diff mbox series

[RFC] exec: Freeze the other threads during a multi-threaded exec

Message ID 87h7tsllgw.fsf@x220.int.ebiederm.org (mailing list archive)
State RFC, archived
Headers show
Series [RFC] exec: Freeze the other threads during a multi-threaded exec | expand

Commit Message

Eric W. Biederman July 27, 2020, 9:03 p.m. UTC
Many of the challenges of implementing a simple version of exec come
from the fact the code handles execing multi-thread processes.

To the best of my knowledge processes with more than one thread
calling exec are not common, and as all of the threads will be killed
by exec there does not appear to be any useful work a thread can
reliably do during exec.

Therefore make it simpler to get exec correct by freezing the other
threads at the beginning of exec.  This removes an entire class of
races, and makes it tractable to fix some of the long standing
issues with exec.

One issue that this change makes it easier to solve is the issue of
deailing with the file table.  Today exec unshares the file table at
the beginning to ensure there are no weird races with file
descriptors.  Unfortunately this unsharing can unshare the file table
when only threads of the current process share it.  Which results in
unnecessary unsharing and posix locks being inappropriately dropped by
a multi-threaded exec.  With all of the threads frozen the thread
count is stable and it is easy to tell if the if the file table really
needs to be unshared by exec.

Further this changes allows seccomp to stop taking cred_guard_mutex,
as the seccomp code takes cred_guard_mutex to protect against another
thread that is in the middle of calling exec and this change
guarantees that if one threads is calling exec all of the other threads
have stopped running.  So this problematic kind of concurrency between
threads can no longer happen.

As this change reuses the generic freezer code interactions with
signal group stop, ptrace stop, the cgroup freezer, fatal signals, and
SIGCONT are already well defined and inherited from the freezer code.
In short other threads will not wake up to participate in signal group
stop, ptrace stop, the cgroup freezer, to handle fatal signals.  As
SIGCONT is handled by the caller that is still processed as usual.
Fatal signals while not processed by other threads are still processed
by the thread calling exec so they take effect as usual.

The code in de_thread was modified to unfreeze the threads at the same
time as it is killing them ensuring that code continues to work as it
does today, and without introducing any races where a thread might
perform any problematic work in the middle of de_thread.

The code in fork is optimized to set TIF_SIGPENDING if the new task
needs to freeze.  This makes it more likely that a new task will
freeze immediately instead of proceeding to userspace to execute some
code, where the next freezing loop will need to tell it to freeze.

A new function exec_freezing is added and called from the refrigerator
so that the freezer code will actually ensure threads called from
exec are frozen as well.

A new function exec_freeze_threads based upon
kernel/power/process.c:try_to_freeze_tasks is added.  To play well
with other uses of the kernel freezer it uses a killable sleep wrapped
with freezer_do_not_count/freezer_count.  So that it will not lock out
the global freezer when it is simply waiting for it's threads to
freeze.  This new function also ensures that only one thread of a
thread group can enter exec at a time.

While this does now allow every process to touch system_freezing_cnt
which is an int.  This should be fine as the maximum number of tasks
is PID_MAX_LIMIT which has an upper bound of 4 * 1024 * 1024.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 99 +++++++++++++++++++++++++++++++++++-
 include/linux/sched/signal.h | 10 ++++
 kernel/fork.c                |  3 ++
 kernel/freezer.c             |  2 +-
 4 files changed, 112 insertions(+), 2 deletions(-)

Comments

Linus Torvalds July 28, 2020, 12:20 a.m. UTC | #1
On Mon, Jul 27, 2020 at 2:06 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Therefore make it simpler to get exec correct by freezing the other
> threads at the beginning of exec.  This removes an entire class of
> races, and makes it tractable to fix some of the long standing
> issues with exec.

I hate the global state part of the freezer.

It's also pointless. We don't want to trigger all the tests that
various random driver kernel threads do.

I also really don't like how now execve() by any random person will
suddenly impact everything that might be doing freezing.

It also makes for a possible _huge_ latency regression for execve(),
since freezing really has never been a very low-latency operation.

Other threads doing IO can now basically block execve() for a long
long long time.

Finally, I think your patch is fundamentally broken for another
reason: it depends on CONFIG_FREEZER, and that isn't even required to
be set!

So no, this is not at all acceptable in that form.

Now, maybe we could _make_ it acceptable, by

 (a) add a per-process freezer count to avoid the global state for this case

 (b)  make a small subset of the freezing code available for the
!CONFIG_FREEZER thing

 (c) fix this "simple freezer" to not actually force wakeups etc, but
catch things in the

but honestly, at that point nothing of the "CONFIG_FREEZER" code even
really exists any more. It would be more of a "execve_synchronize()"
thing, where we'd catch things in the scheduler and/or system call
entry/exit or whatever.

Also, that makes these kinds of nasty hacks that just make the
existign freezer code even harder to figure out:

> A new function exec_freeze_threads based upon
> kernel/power/process.c:try_to_freeze_tasks is added.  To play well
> with other uses of the kernel freezer it uses a killable sleep wrapped
> with freezer_do_not_count/freezer_count.

Ugh. Just _ugly_.

And honestly, completely and utterly broken. See above.

I understand the wish to re-use existing infrastructure. But the fact
is, the FREEZER code is just about the _last_ thing you should want to
use. That, and stop_machine(), is just too much of a big hammer.

                Linus
Aleksa Sarai July 28, 2020, 9:41 a.m. UTC | #2
On 2020-07-27, Eric W. Biederman <ebiederm@xmission.com> wrote:
> To the best of my knowledge processes with more than one thread
> calling exec are not common, and as all of the threads will be killed
> by exec there does not appear to be any useful work a thread can
> reliably do during exec.

Every Go program which calls exec (this includes runc, Docker, LXD,
Kubernetes, et al) fills the niche of "multi-threaded program that calls
exec" -- all Go programs are multi-threaded and there's no way of
disabling this. This will most likely cause pretty bad performance
regression for basically all container workloads.
Eric W. Biederman July 28, 2020, 12:18 p.m. UTC | #3
Aleksa Sarai <cyphar@cyphar.com> writes:

> On 2020-07-27, Eric W. Biederman <ebiederm@xmission.com> wrote:
>> To the best of my knowledge processes with more than one thread
>> calling exec are not common, and as all of the threads will be killed
>> by exec there does not appear to be any useful work a thread can
>> reliably do during exec.
>
> Every Go program which calls exec (this includes runc, Docker, LXD,
> Kubernetes, et al) fills the niche of "multi-threaded program that calls
> exec" -- all Go programs are multi-threaded and there's no way of
> disabling this. This will most likely cause pretty bad performance
> regression for basically all container workloads.

So it is a good point that container runtimes use Go, and that
fundamentally anything that uses Go will be multi-threaded.  Having just
looked closely at this I don't think in practice this is an issue even
at this early state of my code.

If those other threads are sleeping the code I have implemented should
be a no-op.

If those threads aren't sleeping you have undefined behavior, as at
some point the kernel will come through and kill those threads.

Further unless I am completely mistaken the container runtimes use
forkAndExecInChild from go/src/syscall/exec_linux.go which performs a
vfork before performing the exec.

Eric
Eric W. Biederman July 28, 2020, 12:39 p.m. UTC | #4
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, Jul 27, 2020 at 2:06 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> Therefore make it simpler to get exec correct by freezing the other
>> threads at the beginning of exec.  This removes an entire class of
>> races, and makes it tractable to fix some of the long standing
>> issues with exec.
>
> I hate the global state part of the freezer.
>
> It's also pointless. We don't want to trigger all the tests that
> various random driver kernel threads do.
>
> I also really don't like how now execve() by any random person will
> suddenly impact everything that might be doing freezing.

Yes.  system_freezing_cnt as an enable/disable that affects all
tasks in the system does seem like a misfeature for use in the context
of exec where only a few tasks need to be dealt with.

> It also makes for a possible _huge_ latency regression for execve(),
> since freezing really has never been a very low-latency operation.
>
> Other threads doing IO can now basically block execve() for a long
> long long time.

Hmm.  Potentially.  The synchronization with the other threads must
happen in a multi-threaded exec in de_thread.

So I need to look at the differences between where de_thread thread
can kill a thread and the freezer can not freeze a thread.  I am hoping
that the freezer has already instrumented most of those sleeps but I
admit I have not looked yet.

> Finally, I think your patch is fundamentally broken for another
> reason: it depends on CONFIG_FREEZER, and that isn't even required to
> be set!

Very true.  I absolutely missed that detail.

> So no, this is not at all acceptable in that form.
>
> Now, maybe we could _make_ it acceptable, by
>
>  (a) add a per-process freezer count to avoid the global state for this case

Perhaps even a single bit.


>  (b)  make a small subset of the freezing code available for the
> !CONFIG_FREEZER thing

The code that is controlled by CONFIG_FREEZER is just kernel/freezer.c,
and include/linux/freezer.h.  Which is 177 + 303 lines respectively
so not much.

Or are you thinking about all of the locations that already include
freezable sleeps?

>  (c) fix this "simple freezer" to not actually force wakeups etc, but
> catch things in the

To catch things in the scheduler I presume?

The thing is the freezer code does not wake up anything if it is in a
sleep that it has wrapped.  Which I thought was just about all
significant ones but I need to verify that.

> but honestly, at that point nothing of the "CONFIG_FREEZER" code even
> really exists any more. It would be more of a "execve_synchronize()"
> thing, where we'd catch things in the scheduler and/or system call
> entry/exit or whatever.

Yes.  Where we catch things seems to be key.  I believe if all sleeps
that are killable plus system call exit should be enough, to be a noop.
As those are the places where the code can be killed now.

The tricky part is to mark processes that are sleeping in such a way
that then they wake up they go into a slow path and they get trapped
by the freezing code.

> Also, that makes these kinds of nasty hacks that just make the
> existign freezer code even harder to figure out:


>> A new function exec_freeze_threads based upon
>> kernel/power/process.c:try_to_freeze_tasks is added.  To play well
>> with other uses of the kernel freezer it uses a killable sleep wrapped
>> with freezer_do_not_count/freezer_count.
>
> Ugh. Just _ugly_.
>
> And honestly, completely and utterly broken. See above.
>
> I understand the wish to re-use existing infrastructure. But the fact
> is, the FREEZER code is just about the _last_ thing you should want to
> use. That, and stop_machine(), is just too much of a big hammer.

Part of my challenge is that it the more layers that get put around a
sleep the trickier it is to subsequently wrap.

I can see the point of building something very simple and more
fundamental that doesn't need as much support as the current freezer.
Perhaps something the freezer can the be rebuilt upon.


If the basic idea of stopping other threads early before we kill them in
exec sounds plausible then I will see what I can do.

Eric
Eric W. Biederman July 28, 2020, 1:20 p.m. UTC | #5
ebiederm@xmission.com (Eric W. Biederman) writes:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> It also makes for a possible _huge_ latency regression for execve(),
>> since freezing really has never been a very low-latency operation.
>>
>> Other threads doing IO can now basically block execve() for a long
>> long long time.
>
> Hmm.  Potentially.  The synchronization with the other threads must
> happen in a multi-threaded exec in de_thread.
>
> So I need to look at the differences between where de_thread thread
> can kill a thread and the freezer can not freeze a thread.  I am hoping
> that the freezer has already instrumented most of those sleeps but I
> admit I have not looked yet.

Alright I have looked at the freezer a bit more and I now see that the
point of marking things freezable is for kernel threads rather that user
space threads.  I think there are 5 maybe 6 places the code sleeps
reachable by userspace threads that are marked as freezable and most
of those are callable from get_signal.

For exec all I care about are user space threads.  So it appears the
freezer infrastructure adds very little.

Now to see if I can find another way to divert a task into a slow path
as it wakes up, so I don't need to manually wrap all of the sleeping
calls.  Something that plays nice with the scheduler.

Eric
Linus Torvalds July 28, 2020, 6:17 p.m. UTC | #6
On Tue, Jul 28, 2020 at 6:23 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> For exec all I care about are user space threads.  So it appears the
> freezer infrastructure adds very little.

Yeah. 99% of the freezer stuff is for just adding the magic notations
for kernel threads, and for any user space threads it seems the wrong
interface.

> Now to see if I can find another way to divert a task into a slow path
> as it wakes up, so I don't need to manually wrap all of the sleeping
> calls.  Something that plays nice with the scheduler.

The thing is, how many places really care?

Because I think there are like five of them. And they are all marked
by taking cred_guard_mutex, or the file table lock.

So it seems really excessive to then create some whole new "let's
serialize every thread", when you actually don't care about any of it,
except for a couple of very very special cases.

If you care about "thread count stable", you care about exit() and
clone().  You don't care about threads that are happily running - or
sleeping - doing their own thing.

So trying to catch those threads and freezing them really feels like
entirely the wrong interface.

             Linus
Eric W. Biederman July 30, 2020, 1:16 p.m. UTC | #7
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jul 28, 2020 at 6:23 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> For exec all I care about are user space threads.  So it appears the
>> freezer infrastructure adds very little.
>
> Yeah. 99% of the freezer stuff is for just adding the magic notations
> for kernel threads, and for any user space threads it seems the wrong
> interface.
>
>> Now to see if I can find another way to divert a task into a slow path
>> as it wakes up, so I don't need to manually wrap all of the sleeping
>> calls.  Something that plays nice with the scheduler.
>
> The thing is, how many places really care?
>
> Because I think there are like five of them. And they are all marked
> by taking cred_guard_mutex, or the file table lock.
>
> So it seems really excessive to then create some whole new "let's
> serialize every thread", when you actually don't care about any of it,
> except for a couple of very very special cases.
>
> If you care about "thread count stable", you care about exit() and
> clone().  You don't care about threads that are happily running - or
> sleeping - doing their own thing.
>
> So trying to catch those threads and freezing them really feels like
> entirely the wrong interface.

For me stopping the other threads has been a conceptually simple
direction that needs exploration even if it doesn't work out.

On that note I have something that is just a little longer than the
patch I posted that doesn't use the freezer, and leans on the fact that
TASK_INTERRUPTBLE and TASK_KILLABLE can occassionaly expect a spurious
wake up.  Which means (with the right locking) those two states
can be transformed into TASK_WAKEKILL to keep sleeping processes
sleeping.

After that I only need one small test in get_signal to catch the
unlikely case of processes running in userspace.

I have not figured out TASK_STOPPED or TASK_TRACED yet as they do
not handle spurious wake ups.

So I think I can stop and keep stopped the other threads in the group
without too much code or complexity for other parts of the kernel
(assuming spurious wakes ups are allowed).




Even with the other threads stopped the code does not simplify as
much as I had hoped.  The code still has to deal with the existence
of the other threads.  So while races don't happen and thus the code
is easier to understand and make correct the actual work of walking
the threads making a count etc still remains.



The real sticky widget right now is the unshare_files call.  When Al
moved the call in fd8328be874f ("[PATCH] sanitize handling of shared
descriptor tables in failing execve()") it introduced a regression that
causes posix file locks to be improperly removed during exec.
Unfortunately no one noticed for about a decade.

What caused the regression is that unshare_files is a noop if the usage
count is 1.  Which means that after de_thread unshare_files only does
something if our file table is shared by another process.  However when
unshare_files is called before de_thread in a multi-threaded process
unshare_files will always perform work.

For the locks that set fl_owner to current->files unsharing
current->files when unnecessary already causes problems, as we now
have an unnecessary change of owner during exec.

After the unnecessary change in owner the existing locks are
eventually removed at the end of bprm_execve:
    bprm_execve()
       if (displaced)
           put_files_struct(displaced)
              filp_close()
                 locks_remove_posix()
                    /* Which removes the posix locks */



After 12 years moving unshare_files back where it used to be is
also problematic, as with the addition of execveat we grew
a clear dependency other threads not messing with our open files:

	/*
	 * Record that a name derived from an O_CLOEXEC fd will be
	 * inaccessible after exec. Relies on having exclusive access to
	 * current->files (due to unshare_files above).
	 */
	if (bprm->fdpath &&
	    close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
		bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;


I have made a cursory look and I don't expect any other issues as the
only code in exec that messes with file descriptors is in fs/exec.c
Everything else simply uses "struct file *".


Testing to see if the file descriptor is close_on_exec is just there to
prevent trying to run a script that through a close_on_exec file
descriptor, that is part of the path to the script.  So it is a quality
of implementation issue so the code does not fail later if userspace
tries something silly.

So I think we can safely just update the comment and say if userspace
messes with the file descriptor they pass to exec during exec userspace
can keep both pieces, as it is a self inflicted problem.

All of the other issues I see with reverting the location where the file
table is unshared also look like userspace shooting themselves in the
foot and not a problem with correctness of kernel code.

Which is a long way of saying that I have just convinced myself to
effectively revert fd8328be874f ("[PATCH] sanitize handling of shared
descriptor tables in failing execve()") 

A memory allocation failure after the point of no return is the only
reason to avoid doing this, and since the unshare is expected to
be a noop most of the time that doesn't look like a downside either.


Assuming I don't find anything else I think I will kick down the road
a bit the problem of stopping other threads during exec.  I can handle
unshare_files and seccomp without it.  There still might be something
I can not solve another way, but until I find it I will put this on the
back burner.

Eric
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 3698252719a3..cce6b700c3bb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -62,6 +62,7 @@ 
 #include <linux/oom.h>
 #include <linux/compat.h>
 #include <linux/vmalloc.h>
+#include <linux/freezer.h>
 
 #include <linux/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1145,6 +1146,94 @@  static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static void unfreeze_threads(struct signal_struct *signal)
+{
+	if (signal->group_execing_task) {
+		signal->group_execing_task = NULL;
+		atomic_dec(&system_freezing_cnt);
+	}
+}
+
+/*
+ * Freeze all other threads in the group for exec
+ */
+static int exec_freeze_threads(void)
+{
+	struct task_struct *p = current, *t;
+	struct signal_struct *signal = p->signal;
+	spinlock_t *lock = &p->sighand->siglock;
+	unsigned long sleep = msecs_to_jiffies(1);
+	unsigned int todo;
+	int ret;
+
+	if (thread_group_empty(p))
+		return 0;
+
+	spin_lock_irq(lock);
+	if (signal_pending(p) || signal_group_exit(signal) ||
+	    signal->group_execing_task) {
+		spin_unlock(lock);
+		return -ERESTARTNOINTR;
+	}
+
+	signal->group_execing_task = p;
+	atomic_inc(&system_freezing_cnt);
+	for (;;) {
+		todo = 0;
+		__for_each_thread(signal, t) {
+			if ((t == p) || !freeze_task(p))
+				continue;
+
+			if (!freezer_should_skip(p))
+				todo++;
+		}
+
+		if (!todo || __fatal_signal_pending(p) ||
+		    (sleep > msecs_to_jiffies(8)))
+			break;
+
+		/*
+		 * We need to retry, but first give the freezing tasks some
+		 * time to enter the refrigerator.  Start with an initial
+		 * 1 ms sleep followed by exponential backoff until 8 ms.
+		 */
+		spin_unlock_irq(lock);
+
+		freezer_do_not_count();
+		schedule_timeout_killable(sleep);
+		freezer_count();
+		sleep *= 2;
+
+		spin_lock_irq(lock);
+	}
+	ret = 0;
+	if (todo)
+		ret = -EBUSY;
+	if (__fatal_signal_pending(p))
+		ret = -ERESTARTNOINTR;
+	if (ret)
+		unfreeze_threads(signal);
+	spin_unlock_irq(lock);
+	return ret;
+}
+
+static void exec_thaw_threads(void)
+{
+	struct task_struct *p = current, *t;
+	struct signal_struct *signal = p->signal;
+	spinlock_t *lock = &p->sighand->siglock;
+
+	spin_lock_irq(lock);
+	if (signal->group_execing_task) {
+		unfreeze_threads(signal);
+		__for_each_thread(signal, t) {
+			if (t != p)
+				__thaw_task(t);
+		}
+	}
+	spin_unlock_irq(lock);
+}
+
 static int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
@@ -1167,6 +1256,7 @@  static int de_thread(struct task_struct *tsk)
 		return -EAGAIN;
 	}
 
+	unfreeze_threads(sig);
 	sig->group_exit_task = tsk;
 	sig->notify_count = zap_other_threads(tsk);
 	if (!thread_group_leader(tsk))
@@ -1885,10 +1975,15 @@  static int bprm_execve(struct linux_binprm *bprm,
 	struct files_struct *displaced;
 	int retval;
 
-	retval = unshare_files(&displaced);
+	/* If the process is multi-threaded stop the other threads */
+	retval = exec_freeze_threads();
 	if (retval)
 		return retval;
 
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out_ret;
+
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
 		goto out_files;
@@ -1949,6 +2044,8 @@  static int bprm_execve(struct linux_binprm *bprm,
 out_files:
 	if (displaced)
 		reset_files_struct(displaced);
+out_ret:
+	exec_thaw_threads();
 
 	return retval;
 }
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..bbf53fcd913b 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -106,6 +106,9 @@  struct signal_struct {
 	int			notify_count;
 	struct task_struct	*group_exit_task;
 
+	/* Task that is performing exec */
+	struct task_struct	*group_execing_task;
+
 	/* thread group stop support, overloads group_exit_code too */
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
@@ -269,6 +272,13 @@  static inline int signal_group_exit(const struct signal_struct *sig)
 		(sig->group_exit_task != NULL);
 }
 
+static inline bool exec_freezing(struct task_struct *p)
+{
+	struct task_struct *execing = READ_ONCE(p->signal->group_execing_task);
+
+	return execing && (execing != p);
+}
+
 extern void flush_signals(struct task_struct *);
 extern void ignore_signals(struct task_struct *);
 extern void flush_signal_handlers(struct task_struct *, int force_default);
diff --git a/kernel/fork.c b/kernel/fork.c
index bf215af7a904..d3a0f914231c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2299,6 +2299,9 @@  static __latent_entropy struct task_struct *copy_process(
 	syscall_tracepoint_update(p);
 	write_unlock_irq(&tasklist_lock);
 
+	/* Ensure a process that should freeze exits on the slow path */
+	if (freezing(p))
+		set_tsk_thread_flag(p, TIF_SIGPENDING);
 	proc_fork_connector(p);
 	cgroup_post_fork(p, args);
 	perf_event_fork(p);
diff --git a/kernel/freezer.c b/kernel/freezer.c
index dc520f01f99d..97c6f69b832e 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -42,7 +42,7 @@  bool freezing_slow_path(struct task_struct *p)
 	if (test_tsk_thread_flag(p, TIF_MEMDIE))
 		return false;
 
-	if (pm_nosig_freezing || cgroup_freezing(p))
+	if (pm_nosig_freezing || cgroup_freezing(p) || exec_freezing(p))
 		return true;
 
 	if (pm_freezing && !(p->flags & PF_KTHREAD))