diff mbox series

[7/7] exec: Implement kernel_execve

Message ID 87wo365ikj.fsf@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series Implementing kernel_execve | expand

Commit Message

Eric W. Biederman July 14, 2020, 1:31 p.m. UTC
To allow the kernel not to play games with set_fs to call exec
implement kernel_execve.  The function kernel_execve takes pointers
into kernel memory and copies the values pointed to onto the new
userspace stack.

The calls with arguments from kernel space of do_execve are replaced
with calls to kernel_execve.

The calls do_execve and do_execveat are made static as there are now
no callers outside of exec.

The comments that mention do_execve are updated to refer to
kernel_execve or execve depending on the circumstances.  In addition
to correcting the comments, this makes it easy to grep for do_execve
and verify it is not used.

Inspired-by: https://lkml.kernel.org/r/20200627072704.2447163-1-hch@lst.de
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 arch/x86/entry/entry_32.S      |  2 +-
 arch/x86/entry/entry_64.S      |  2 +-
 arch/x86/kernel/unwind_frame.c |  2 +-
 fs/exec.c                      | 88 +++++++++++++++++++++++++++++++++-
 include/linux/binfmts.h        |  9 +---
 init/main.c                    |  4 +-
 kernel/umh.c                   |  6 +--
 security/tomoyo/common.h       |  2 +-
 security/tomoyo/domain.c       |  4 +-
 security/tomoyo/tomoyo.c       |  4 +-
 10 files changed, 100 insertions(+), 23 deletions(-)

Comments

Kees Cook July 14, 2020, 9:49 p.m. UTC | #1
On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> +static int count_strings_kernel(const char *const *argv)
> +{
> +	int i;
> +
> +	if (!argv)
> +		return 0;
> +
> +	for (i = 0; argv[i]; ++i) {
> +		if (i >= MAX_ARG_STRINGS)
> +			return -E2BIG;
> +		if (fatal_signal_pending(current))
> +			return -ERESTARTNOHAND;
> +		cond_resched();
> +	}
> +	return i;
> +}

I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
refactor that too? (And maybe rename it to count_strings_user()?)

Otherwise, looks good:

Reviewed-by: Kees Cook <keescook@chromium.org>
Christoph Hellwig July 15, 2020, 6:42 a.m. UTC | #2
> +static int count_strings_kernel(const char *const *argv)
> +{
> +	int i;
> +
> +	if (!argv)
> +		return 0;
> +
> +	for (i = 0; argv[i]; ++i) {
> +		if (i >= MAX_ARG_STRINGS)
> +			return -E2BIG;
> +		if (fatal_signal_pending(current))
> +			return -ERESTARTNOHAND;
> +		cond_resched();

I don't think we need a fatal_signal_pending and cond_resched() is
needed in each step given that we don't actually do anything.
Christoph Hellwig July 15, 2020, 6:42 a.m. UTC | #3
On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > +static int count_strings_kernel(const char *const *argv)
> > +{
> > +	int i;
> > +
> > +	if (!argv)
> > +		return 0;
> > +
> > +	for (i = 0; argv[i]; ++i) {
> > +		if (i >= MAX_ARG_STRINGS)
> > +			return -E2BIG;
> > +		if (fatal_signal_pending(current))
> > +			return -ERESTARTNOHAND;
> > +		cond_resched();
> > +	}
> > +	return i;
> > +}
> 
> I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> refactor that too? (And maybe rename it to count_strings_user()?)

Liks this?

http://git.infradead.org/users/hch/misc.git/commitdiff/35a3129dab5b712b018c30681d15de42d9509731
David Laight July 15, 2020, 2:55 p.m. UTC | #4
From: Christoph Hellwig
> Sent: 15 July 2020 07:43
> Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> 
> On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > +static int count_strings_kernel(const char *const *argv)
> > > +{
> > > +	int i;
> > > +
> > > +	if (!argv)
> > > +		return 0;
> > > +
> > > +	for (i = 0; argv[i]; ++i) {
> > > +		if (i >= MAX_ARG_STRINGS)
> > > +			return -E2BIG;
> > > +		if (fatal_signal_pending(current))
> > > +			return -ERESTARTNOHAND;
> > > +		cond_resched();
> > > +	}
> > > +	return i;
> > > +}
> >
> > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > refactor that too? (And maybe rename it to count_strings_user()?)

Thinks....
If you setup env[] and argv[] on the new user stack early in exec processing
then you may not need any limits at all - except the size of the user stack.
Even the get_user() loop will hit an invalid address before the counter
wraps (provided it is unsigned long).

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Kees Cook July 15, 2020, 3 p.m. UTC | #5
On Wed, Jul 15, 2020 at 07:42:48AM +0100, Christoph Hellwig wrote:
> On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > +static int count_strings_kernel(const char *const *argv)
> > > +{
> > > +	int i;
> > > +
> > > +	if (!argv)
> > > +		return 0;
> > > +
> > > +	for (i = 0; argv[i]; ++i) {
> > > +		if (i >= MAX_ARG_STRINGS)
> > > +			return -E2BIG;
> > > +		if (fatal_signal_pending(current))
> > > +			return -ERESTARTNOHAND;
> > > +		cond_resched();
> > > +	}
> > > +	return i;
> > > +}
> > 
> > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > refactor that too? (And maybe rename it to count_strings_user()?)
> 
> Liks this?
> 
> http://git.infradead.org/users/hch/misc.git/commitdiff/35a3129dab5b712b018c30681d15de42d9509731

Heh, yes please. :) (Which branch is this from? Are yours and Eric's
tree going to collide?)
Kees Cook July 15, 2020, 3:09 p.m. UTC | #6
On Wed, Jul 15, 2020 at 02:55:50PM +0000, David Laight wrote:
> From: Christoph Hellwig
> > Sent: 15 July 2020 07:43
> > Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> > 
> > On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > > +static int count_strings_kernel(const char *const *argv)
> > > > +{
> > > > +	int i;
> > > > +
> > > > +	if (!argv)
> > > > +		return 0;
> > > > +
> > > > +	for (i = 0; argv[i]; ++i) {
> > > > +		if (i >= MAX_ARG_STRINGS)
> > > > +			return -E2BIG;
> > > > +		if (fatal_signal_pending(current))
> > > > +			return -ERESTARTNOHAND;
> > > > +		cond_resched();
> > > > +	}
> > > > +	return i;
> > > > +}
> > >
> > > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > > refactor that too? (And maybe rename it to count_strings_user()?)
> 
> Thinks....
> If you setup env[] and argv[] on the new user stack early in exec processing
> then you may not need any limits at all - except the size of the user stack.
> Even the get_user() loop will hit an invalid address before the counter
> wraps (provided it is unsigned long).

*grumpy noises* Yes, but not in practice (if I'm understanding what you
mean). The expectations of a number of execution environments can be
really odd-ball. I've tried to collect the notes from over the years in
prepare_arg_pages()'s comments, and it mostly boils down to "there has
to be enough room for the exec to start" otherwise the exec ends up in a
hard-to-debug failure state (i.e. past the "point of no return", where you
get no useful information about the cause of the SEGV). So the point has
been to move as many of the setup checks as early as possible and report
about them if they fail. The argv processing is already very early, but
it needs to do the limit checks otherwise it'll just break after the exec
is underway and the process will just SEGV. (And ... some environments
will attempt to dynamically check the size of the argv space by growing
until it sees E2BIG, so we can't just remove it and let those hit SEGV.)
David Laight July 15, 2020, 4:46 p.m. UTC | #7
From: Kees Cook <keescook@chromium.org>
> Sent: 15 July 2020 16:09
> 
> On Wed, Jul 15, 2020 at 02:55:50PM +0000, David Laight wrote:
> > From: Christoph Hellwig
> > > Sent: 15 July 2020 07:43
> > > Subject: Re: [PATCH 7/7] exec: Implement kernel_execve
> > >
> > > On Tue, Jul 14, 2020 at 02:49:23PM -0700, Kees Cook wrote:
> > > > On Tue, Jul 14, 2020 at 08:31:40AM -0500, Eric W. Biederman wrote:
> > > > > +static int count_strings_kernel(const char *const *argv)
> > > > > +{
> > > > > +	int i;
> > > > > +
> > > > > +	if (!argv)
> > > > > +		return 0;
> > > > > +
> > > > > +	for (i = 0; argv[i]; ++i) {
> > > > > +		if (i >= MAX_ARG_STRINGS)
> > > > > +			return -E2BIG;
> > > > > +		if (fatal_signal_pending(current))
> > > > > +			return -ERESTARTNOHAND;
> > > > > +		cond_resched();
> > > > > +	}
> > > > > +	return i;
> > > > > +}
> > > >
> > > > I notice count() is only ever called with MAX_ARG_STRINGS. Perhaps
> > > > refactor that too? (And maybe rename it to count_strings_user()?)
> >
> > Thinks....
> > If you setup env[] and argv[] on the new user stack early in exec processing
> > then you may not need any limits at all - except the size of the user stack.
> > Even the get_user() loop will hit an invalid address before the counter
> > wraps (provided it is unsigned long).
> 
> *grumpy noises* Yes, but not in practice (if I'm understanding what you
> mean). The expectations of a number of execution environments can be
> really odd-ball. I've tried to collect the notes from over the years in
> prepare_arg_pages()'s comments, and it mostly boils down to "there has
> to be enough room for the exec to start" otherwise the exec ends up in a
> hard-to-debug failure state (i.e. past the "point of no return", where you
> get no useful information about the cause of the SEGV). So the point has
> been to move as many of the setup checks as early as possible and report
> about them if they fail. The argv processing is already very early, but
> it needs to do the limit checks otherwise it'll just break after the exec
> is underway and the process will just SEGV. (And ... some environments
> will attempt to dynamically check the size of the argv space by growing
> until it sees E2BIG, so we can't just remove it and let those hit SEGV.)

Yes - I bet the code is horrid.
I guess the real problem is that you'd need access to the old process's
user addresses and the new process's stack area at the same time.
Unless there is a suitable hole in the old process's address map
any attempted trick will fall foul of cache aliasing on some
architectures - like anything else that does page-loaning.

I'm sure there are hair-brained schemes that might work.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Christoph Hellwig July 15, 2020, 6:20 p.m. UTC | #8
On Wed, Jul 15, 2020 at 08:00:16AM -0700, Kees Cook wrote:
> Heh, yes please. :) (Which branch is this from?

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/exec-cleanup

> Are yours and Eric's
> tree going to collide?)

Yes, badly.
Eric W. Biederman July 15, 2020, 6:23 p.m. UTC | #9
Christoph Hellwig <hch@infradead.org> writes:

>> +static int count_strings_kernel(const char *const *argv)
>> +{
>> +	int i;
>> +
>> +	if (!argv)
>> +		return 0;
>> +
>> +	for (i = 0; argv[i]; ++i) {
>> +		if (i >= MAX_ARG_STRINGS)
>> +			return -E2BIG;
>> +		if (fatal_signal_pending(current))
>> +			return -ERESTARTNOHAND;
>> +		cond_resched();
>
> I don't think we need a fatal_signal_pending and cond_resched() is
> needed in each step given that we don't actually do anything.

If we have a MAX_ARG_STRINGS sized argv passed in, that is 2^31
iterations of the loop.  A processor at 2Ghz performs roughly 2^31
cycles per second.  So this loop has the potential to run for an entire
second.  That is long enough to need fatal_signal_pending() and
cond_resched checks.

In practice I don't think we have any argv arrays anywhere near that big
passed in from the kernel.  However removing the logic that accounts for
long running loops is best handled as a separate change so that people
will analyze the patch based on that criterian, and so that in the
highly unlikely even something goes wrong people have a nice commit
to bisect things to.

Eric
diff mbox series

Patch

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 024d7d276cd4..8f4e085ee06d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -854,7 +854,7 @@  SYM_CODE_START(ret_from_fork)
 	CALL_NOSPEC ebx
 	/*
 	 * A kernel thread is allowed to return here after successfully
-	 * calling do_execve().  Exit to userspace to complete the execve()
+	 * calling kernel_execve().  Exit to userspace to complete the execve()
 	 * syscall.
 	 */
 	movl	$0, PT_EAX(%esp)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index d2a00c97e53f..73c7e255256b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -293,7 +293,7 @@  SYM_CODE_START(ret_from_fork)
 	CALL_NOSPEC rbx
 	/*
 	 * A kernel thread is allowed to return here after successfully
-	 * calling do_execve().  Exit to userspace to complete the execve()
+	 * calling kernel_execve().  Exit to userspace to complete the execve()
 	 * syscall.
 	 */
 	movq	$0, RAX(%rsp)
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 722a85f3b2dd..e40b4942157f 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -275,7 +275,7 @@  bool unwind_next_frame(struct unwind_state *state)
 		 * This user_mode() check is slightly broader than a PF_KTHREAD
 		 * check because it also catches the awkward situation where a
 		 * newly forked kthread transitions into a user task by calling
-		 * do_execve(), which eventually clears PF_KTHREAD.
+		 * kernel_execve(), which eventually clears PF_KTHREAD.
 		 */
 		if (!user_mode(regs))
 			goto the_end;
diff --git a/fs/exec.c b/fs/exec.c
index f8135dc149b3..3698252719a3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -448,6 +448,23 @@  static int count(struct user_arg_ptr argv, int max)
 	return i;
 }
 
+static int count_strings_kernel(const char *const *argv)
+{
+	int i;
+
+	if (!argv)
+		return 0;
+
+	for (i = 0; argv[i]; ++i) {
+		if (i >= MAX_ARG_STRINGS)
+			return -E2BIG;
+		if (fatal_signal_pending(current))
+			return -ERESTARTNOHAND;
+		cond_resched();
+	}
+	return i;
+}
+
 static int bprm_stack_limits(struct linux_binprm *bprm)
 {
 	unsigned long limit, ptr_size;
@@ -624,6 +641,20 @@  int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
 }
 EXPORT_SYMBOL(copy_string_kernel);
 
+static int copy_strings_kernel(int argc, const char *const *argv,
+			       struct linux_binprm *bprm)
+{
+	while (argc-- > 0) {
+		int ret = copy_string_kernel(argv[argc], bprm);
+		if (ret < 0)
+			return ret;
+		if (fatal_signal_pending(current))
+			return -ERESTARTNOHAND;
+		cond_resched();
+	}
+	return 0;
+}
+
 #ifdef CONFIG_MMU
 
 /*
@@ -1991,7 +2022,60 @@  static int do_execveat_common(int fd, struct filename *filename,
 	return retval;
 }
 
-int do_execve(struct filename *filename,
+int kernel_execve(const char *kernel_filename,
+		  const char *const *argv, const char *const *envp)
+{
+	struct filename *filename;
+	struct linux_binprm *bprm;
+	int fd = AT_FDCWD;
+	int retval;
+
+	filename = getname_kernel(kernel_filename);
+	if (IS_ERR(filename))
+		return PTR_ERR(filename);
+
+	bprm = alloc_bprm(fd, filename);
+	if (IS_ERR(bprm)) {
+		retval = PTR_ERR(bprm);
+		goto out_ret;
+	}
+
+	retval = count_strings_kernel(argv);
+	if (retval < 0)
+		goto out_free;
+	bprm->argc = retval;
+
+	retval = count_strings_kernel(envp);
+	if (retval < 0)
+		goto out_free;
+	bprm->envc = retval;
+
+	retval = bprm_stack_limits(bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = copy_string_kernel(bprm->filename, bprm);
+	if (retval < 0)
+		goto out_free;
+	bprm->exec = bprm->p;
+
+	retval = copy_strings_kernel(bprm->envc, envp, bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = copy_strings_kernel(bprm->argc, argv, bprm);
+	if (retval < 0)
+		goto out_free;
+
+	retval = bprm_execve(bprm, fd, filename, 0);
+out_free:
+	free_bprm(bprm);
+out_ret:
+	putname(filename);
+	return retval;
+}
+
+static int do_execve(struct filename *filename,
 	const char __user *const __user *__argv,
 	const char __user *const __user *__envp)
 {
@@ -2000,7 +2084,7 @@  int do_execve(struct filename *filename,
 	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
 }
 
-int do_execveat(int fd, struct filename *filename,
+static int do_execveat(int fd, struct filename *filename,
 		const char __user *const __user *__argv,
 		const char __user *const __user *__envp,
 		int flags)
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 8e9e1b0c8eb8..0571701ab1c5 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -135,12 +135,7 @@  int copy_string_kernel(const char *arg, struct linux_binprm *bprm);
 extern void set_binfmt(struct linux_binfmt *new);
 extern ssize_t read_code(struct file *, unsigned long, loff_t, size_t);
 
-extern int do_execve(struct filename *,
-		     const char __user * const __user *,
-		     const char __user * const __user *);
-extern int do_execveat(int, struct filename *,
-		       const char __user * const __user *,
-		       const char __user * const __user *,
-		       int);
+int kernel_execve(const char *filename,
+		  const char *const *argv, const char *const *envp);
 
 #endif /* _LINUX_BINFMTS_H */
diff --git a/init/main.c b/init/main.c
index 0ead83e86b5a..78ccec5c28f3 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1329,9 +1329,7 @@  static int run_init_process(const char *init_filename)
 	pr_debug("  with environment:\n");
 	for (p = envp_init; *p; p++)
 		pr_debug("    %s\n", *p);
-	return do_execve(getname_kernel(init_filename),
-		(const char __user *const __user *)argv_init,
-		(const char __user *const __user *)envp_init);
+	return kernel_execve(init_filename, argv_init, envp_init);
 }
 
 static int try_to_run_init_process(const char *init_filename)
diff --git a/kernel/umh.c b/kernel/umh.c
index 6ca2096298b9..a25433f9cd9a 100644
--- a/kernel/umh.c
+++ b/kernel/umh.c
@@ -98,9 +98,9 @@  static int call_usermodehelper_exec_async(void *data)
 
 	commit_creds(new);
 
-	retval = do_execve(getname_kernel(sub_info->path),
-			   (const char __user *const __user *)sub_info->argv,
-			   (const char __user *const __user *)sub_info->envp);
+	retval = kernel_execve(sub_info->path,
+			       (const char *const *)sub_info->argv,
+			       (const char *const *)sub_info->envp);
 out:
 	sub_info->retval = retval;
 	/*
diff --git a/security/tomoyo/common.h b/security/tomoyo/common.h
index 050473df5809..85246b9df7ca 100644
--- a/security/tomoyo/common.h
+++ b/security/tomoyo/common.h
@@ -425,7 +425,7 @@  struct tomoyo_request_info {
 	struct tomoyo_obj_info *obj;
 	/*
 	 * For holding parameters specific to execve() request.
-	 * NULL if not dealing do_execve().
+	 * NULL if not dealing execve().
 	 */
 	struct tomoyo_execve *ee;
 	struct tomoyo_domain_info *domain;
diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
index 7869d6a9980b..53b3e1f5f227 100644
--- a/security/tomoyo/domain.c
+++ b/security/tomoyo/domain.c
@@ -767,7 +767,7 @@  int tomoyo_find_next_domain(struct linux_binprm *bprm)
 
 	/*
 	 * Check for domain transition preference if "file execute" matched.
-	 * If preference is given, make do_execve() fail if domain transition
+	 * If preference is given, make execve() fail if domain transition
 	 * has failed, for domain transition preference should be used with
 	 * destination domain defined.
 	 */
@@ -810,7 +810,7 @@  int tomoyo_find_next_domain(struct linux_binprm *bprm)
 		snprintf(ee->tmp, TOMOYO_EXEC_TMPSIZE - 1, "<%s>",
 			 candidate->name);
 		/*
-		 * Make do_execve() fail if domain transition across namespaces
+		 * Make execve() fail if domain transition across namespaces
 		 * has failed.
 		 */
 		reject_on_transition_failure = true;
diff --git a/security/tomoyo/tomoyo.c b/security/tomoyo/tomoyo.c
index f9adddc42ac8..1f3cd432d830 100644
--- a/security/tomoyo/tomoyo.c
+++ b/security/tomoyo/tomoyo.c
@@ -93,7 +93,7 @@  static int tomoyo_bprm_check_security(struct linux_binprm *bprm)
 	struct tomoyo_task *s = tomoyo_task(current);
 
 	/*
-	 * Execute permission is checked against pathname passed to do_execve()
+	 * Execute permission is checked against pathname passed to execve()
 	 * using current domain.
 	 */
 	if (!s->old_domain_info) {
@@ -307,7 +307,7 @@  static int tomoyo_file_fcntl(struct file *file, unsigned int cmd,
  */
 static int tomoyo_file_open(struct file *f)
 {
-	/* Don't check read permission here if called from do_execve(). */
+	/* Don't check read permission here if called from execve(). */
 	if (current->in_execve)
 		return 0;
 	return tomoyo_check_open_permission(tomoyo_domain(), &f->f_path,