diff mbox series

[RFC] exec: Conceal the other threads from wakeups during exec

Message ID 87pn8c1uj6.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series [RFC] exec: Conceal the other threads from wakeups during exec | expand

Commit Message

Eric W. Biederman July 30, 2020, 10:56 p.m. UTC
Right now I think I see solutions to the problems of exec without
using this code.

However this code is tested and working for the common cases (and has no
lockdep warnings) and the techniques it is using could potentially be
used to simplify the freezer, the cgroup v1 freezer, or the cgroup v2
freezer.

The key is the function make_task_wakekill which could probably
benefit from a little more review and refinement but appears to
be basically correct.

---
[This change requires more work to handle TASK_STOPPED and TASK_TRACED]
[This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ]

Many of the challenges of implementing a simple version of exec come
from the fact the code handles exec'ing multi-thread processes.

To the best of my knowledge processes with more than one thread
calling exec are not common, and as all of the threads will be killed
by exec there does not appear to be any useful work a thread can
reliably do during exec.

Therefore make it simpler to get exec correct by concealing the other
threads from wakeups at the beginning of exec.  This removes an entire
class of races, and makes it tractable to fix some of the long
standing issues with exec.

One issue that this change makes it easier to solve is the issue of
deailing with the file table.  Today exec unshares the file table at
the beginning to ensure there are no weird races with file
descriptors.  Unfortunately this unsharing can unshare the file table
when only threads of the current process share it.  Which results in
unnecessary unsharing and posix locks being inappropriately dropped by
a multi-threaded exec.  With all of the threads frozen the thread
count is stable and it is easy to tell if the if the file table really
needs to be unshared by exec.

Further this changes allows seccomp to stop taking cred_guard_mutex,
as the seccomp code takes cred_guard_mutex to protect against another
thread that is in the middle of calling exec and this change
guarantees that if one threads is calling exec all of the other threads
have stopped running.  So this problematic kind of concurrency between
threads can no longer happen.

The code in de_thread was modified to unmask the threads at the same
time as it is killing them ensuring that code continues to work as it
does today, and without introducing any races where a thread might
perform any problematic work in the middle of de_thread.

The code in fork is modified to fail if another thread in the parent
is in the middle of fork.

A new generic scheduler function make_task_killable is added.
I think the locking is ok, changing task->state under
both pi_lock and the rq_lock, but I have not done a detailed
looked yet to confirm that I am not missing something subtle.

A new function exec_conceal_threads is added, to set
group_execing_task and walk through all of the threads and change
their state to TASK_WAKEKILL if it is not already.

A new companion function exec_reveal_threads sends a wake up to all of
the other threads and clear group_execing_task.  This may cause a
spuroius wake up but that is an uncommon case and the code for
TASK_UNINTERRUPTIBLE and TASK_INTERRUPTIBLE is expected to be handle
spurious so it should be fine.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/exec.c                    | 98 +++++++++++++++++++++++++++++++++++-
 include/linux/sched.h        |  2 +
 include/linux/sched/signal.h |  3 ++
 kernel/fork.c                |  6 +++
 kernel/sched/core.c          | 38 ++++++++++++++
 kernel/signal.c              | 11 ++++
 6 files changed, 157 insertions(+), 1 deletion(-)

Comments

Linus Torvalds July 30, 2020, 11:17 p.m. UTC | #1
On Thu, Jul 30, 2020 at 4:00 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> The key is the function make_task_wakekill which could probably
> benefit from a little more review and refinement but appears to
> be basically correct.

You really need to explain a lot more why you think this is all a good idea.

For example, what if one of those other threads is waiting in line for
a critical lock, and the wait-queue you basically disabled was the
exclusive wait after lock handoff?

That means that the lock will now effectively be held by that thread.
No, it wasn't woken up, but it had the lock handed to it, and it's now
entirely unresponsive until it is killed.

How is that different from the deadlocks you're actually trying to fix?

These are the kinds of problems that the freezer() code had too, with
freezing things that held locks etc.

This approach does seem better than the freezer thing, and if I read
it right it will gather things in the signal handler code, but it's
not obvious that gathering them in random places where they sleep for
random reasons is safe or a good idea.

I can imagine _so_ many dead systems if you just basically froze
something that holds the mmap lock and is sleeping on a page fault,
for example.

Maybe I'm missing something, but I really think your "let's freeze
things" is seriously misguided. You're concentrating on some small
problem and trying to solve that, and not seeign the HUGE HONKING
problems that your approach is fundamentally introducing.

              Linus
Oleg Nesterov July 31, 2020, 6:28 a.m. UTC | #2
Eric, I won't comment the intent, but I too do not understand this idea.

On 07/30, Eric W. Biederman wrote:
>
> [This change requires more work to handle TASK_STOPPED and TASK_TRACED]

Yes. And it is not clear to me how can you solve this.

> [This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ]

Not really, ttwu() can be safely called with siglock held and it takes
pi_lock + rq_lock. Say, signal_wake_up().

> +int make_task_wakekill(struct task_struct *p)
> +{
> +	unsigned long flags;
> +	int cpu, success = 0;
> +	struct rq_flags rf;
> +	struct rq *rq;
> +	long state;
> +
> +	/* Assumes p != current */
> +	preempt_disable();
> +	/*
> +	 * If we are going to change a thread waiting for CONDITION we
> +	 * need to ensure that CONDITION=1 done by the caller can not be
> +	 * reordered with p->state check below. This pairs with mb() in
> +	 * set_current_state() the waiting thread does.
> +	 */
> +	raw_spin_lock_irqsave(&p->pi_lock, flags);
> +	smp_mb__after_spinlock();
> +	state = p->state;
> +
> +	/* FIXME handle TASK_STOPPED and TASK_TRACED */
> +	if ((state == TASK_KILLABLE) ||
> +	    (state == TASK_INTERRUPTIBLE)) {
> +		success = 1;
> +		cpu = task_cpu(p);
> +		rq = cpu_rq(cpu);
> +		rq_lock(rq, &rf);
> +		p->state = TASK_WAKEKILL;

You can only do this if the task was already deactivated. Just suppose it
is preempted or does something like

	set_current_sate(TASK_INTERRUPTIBLE);

	if (CONDITION) {
		// make_task_wakekill() sets state = TASK_WAKEKILL
		__set_current_state(TASK_RUNNING);
		return;
	}

	schedule();

Oleg.
Eric W. Biederman July 31, 2020, 4:50 p.m. UTC | #3
Oleg Nesterov <oleg@redhat.com> writes:

> Eric, I won't comment the intent, but I too do not understand this idea.
>
> On 07/30, Eric W. Biederman wrote:
>>
>> [This change requires more work to handle TASK_STOPPED and TASK_TRACED]
>
> Yes. And it is not clear to me how can you solve this.

I was imagining something putting TASK_STOPPED and TASK_TRACED in a loop
that verified they should be in that state before exiting so they could
handle spurious wake ups.

There are a many subtlties in that code, especially in the conversion
fo TASK_STOPPED to TASK_TRACED.  So I suspect something more would be
required but I have not looked yet to see how tricky that would be.

>> [This adds a new lock ordering dependency siglock -> pi_lock -> rq_lock ]
>
> Not really, ttwu() can be safely called with siglock held and it takes
> pi_lock + rq_lock. Say, signal_wake_up().

Good point.

>> +int make_task_wakekill(struct task_struct *p)
>> +{
>> +	unsigned long flags;
>> +	int cpu, success = 0;
>> +	struct rq_flags rf;
>> +	struct rq *rq;
>> +	long state;
>> +
>> +	/* Assumes p != current */
>> +	preempt_disable();
>> +	/*
>> +	 * If we are going to change a thread waiting for CONDITION we
>> +	 * need to ensure that CONDITION=1 done by the caller can not be
>> +	 * reordered with p->state check below. This pairs with mb() in
>> +	 * set_current_state() the waiting thread does.
>> +	 */
>> +	raw_spin_lock_irqsave(&p->pi_lock, flags);
>> +	smp_mb__after_spinlock();
>> +	state = p->state;
>> +
>> +	/* FIXME handle TASK_STOPPED and TASK_TRACED */
>> +	if ((state == TASK_KILLABLE) ||
>> +	    (state == TASK_INTERRUPTIBLE)) {
>> +		success = 1;
>> +		cpu = task_cpu(p);
>> +		rq = cpu_rq(cpu);
>> +		rq_lock(rq, &rf);
>> +		p->state = TASK_WAKEKILL;
>
> You can only do this if the task was already deactivated. Just suppose it
> is preempted or does something like
>
> 	set_current_sate(TASK_INTERRUPTIBLE);
>
> 	if (CONDITION) {
> 		// make_task_wakekill() sets state = TASK_WAKEKILL
> 		__set_current_state(TASK_RUNNING);
> 		return;
> 	}
>
> 	schedule();

You are quite right.

So that bit of code would need to be:
	if (!task->on_rq)
        	goto out;
	if ((state == TASK_KILLABLE) ||
            (state == TASK_INTERRUPTIBLE)) {
            ...

Eric
Eric W. Biederman July 31, 2020, 5:16 p.m. UTC | #4
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jul 30, 2020 at 4:00 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> The key is the function make_task_wakekill which could probably
>> benefit from a little more review and refinement but appears to
>> be basically correct.
>
> You really need to explain a lot more why you think this is all a good idea.
>
> For example, what if one of those other threads is waiting in line for
> a critical lock, and the wait-queue you basically disabled was the
> exclusive wait after lock handoff?
>
> That means that the lock will now effectively be held by that thread.
> No, it wasn't woken up, but it had the lock handed to it, and it's now
> entirely unresponsive until it is killed.
>
> How is that different from the deadlocks you're actually trying to fix?
>
> These are the kinds of problems that the freezer() code had too, with
> freezing things that held locks etc.
>
> This approach does seem better than the freezer thing, and if I read
> it right it will gather things in the signal handler code, but it's
> not obvious that gathering them in random places where they sleep for
> random reasons is safe or a good idea.
>
> I can imagine _so_ many dead systems if you just basically froze
> something that holds the mmap lock and is sleeping on a page fault,
> for example.
>
> Maybe I'm missing something, but I really think your "let's freeze
> things" is seriously misguided. You're concentrating on some small
> problem and trying to solve that, and not seeign the HUGE HONKING
> problems that your approach is fundamentally introducing.

Very good point.  That would be a priority inversion on mmap_lock.
Without great care that could indeed result in lockups.

That definitely requires the points where things are already sleeping
that can be converted to be opt-in.  Which potentially makes things much
more work.

Thanks, that helps kill my bright idea as I expressed it.

Part of what I was trying to solve (because I ran into the problem while
I was reading the code) was that the freezer, the cgroup v2 freezer, and
other waits do not compose nicely.

Even limited to opt-in locations I think the trick of being able to
transform the wait-state may solve that composition problem.


That said I was really just posting this so if the ideas were good they
could inspire future code, and if the ideas were bad they could be sunk.
When it comes to sorting out future especially in exec I will know which
ideas don't fly, so it will be easier to make the case for ideas that
will work.

Eric
Linus Torvalds July 31, 2020, 5:41 p.m. UTC | #5
On Fri, Jul 31, 2020 at 10:19 AM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> Even limited to opt-in locations I think the trick of being able to
> transform the wait-state may solve that composition problem.

So the part I found intriguing was the "catch things in the signal
handling path".

Catching things there - and *only* there - would avoid a lot of the
problems we had with the freezer. When you're about to return to user
mode, there are no lock inversions etc.

And it kind of makes conceptual sense to do, since what you're trying
to capture is the signal group - so using the signal state to do so
seems like a natural thing to do. No touching of any runqueues or
scheduler data structures, do everything _purely_ with the signal
handling pathways.

So that "feels" ok to me.

That said, I do wonder if there are nasty nasty latency issues with
odd users. Normally, you'd expect that execve() with other threads in
the group shouldn't be a performance issue, because people simply
shouldn't do that. So it might be ok.

And if you capture them all in the signal handling pathway, that ends
up being a very convenient place to zap them all too, so maybe my
latency worry is misguided.

IOW, I think that you could try to do your "freese other threads" not
at all like the freezer, but more like a "collect all threads in their
signal handler parts as the first phase of zapping them".

So maybe this approach is salvageable. I see where something like the
above could work well. But I say that with a lot of handwaving, and
maybe if I see the patch I'd go "Christ, I was a complete idiot for
ever even suggesting that".

                    Linus
Eric W. Biederman July 31, 2020, 8:07 p.m. UTC | #6
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Fri, Jul 31, 2020 at 10:19 AM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> Even limited to opt-in locations I think the trick of being able to
>> transform the wait-state may solve that composition problem.
>
> So the part I found intriguing was the "catch things in the signal
> handling path".
>
> Catching things there - and *only* there - would avoid a lot of the
> problems we had with the freezer. When you're about to return to user
> mode, there are no lock inversions etc.
>
> And it kind of makes conceptual sense to do, since what you're trying
> to capture is the signal group - so using the signal state to do so
> seems like a natural thing to do. No touching of any runqueues or
> scheduler data structures, do everything _purely_ with the signal
> handling pathways.
>
> So that "feels" ok to me.
>
> That said, I do wonder if there are nasty nasty latency issues with
> odd users. Normally, you'd expect that execve() with other threads in
> the group shouldn't be a performance issue, because people simply
> shouldn't do that. So it might be ok.
>
> And if you capture them all in the signal handling pathway, that ends
> up being a very convenient place to zap them all too, so maybe my
> latency worry is misguided.
>
> IOW, I think that you could try to do your "freese other threads" not
> at all like the freezer, but more like a "collect all threads in their
> signal handler parts as the first phase of zapping them".
>
> So maybe this approach is salvageable. I see where something like the
> above could work well. But I say that with a lot of handwaving, and
> maybe if I see the patch I'd go "Christ, I was a complete idiot for
> ever even suggesting that".

Yes.

The tricky bit is that there are a handful of stops that must
be handled, or it is impossible to stop everything without causing
disruption if the exec fails.  The big ones are TASK_STOPPED and
TASK_TRACED.  There is another in wait_for_vfork_done.

At which point I am looking at writting a wrapper around schedule that
changes task state to something like TASK_WAKEKILL when asked, and then
restores the state when released.  Something that is independent of
which freezer the code is using.

It could be the scheduler to with a special bit in state that says
opt-in.  But if we have to opt in it is probably much less error
prone to write the code as an wrapper around schedule, and only
modify the core scheduling code if necessary.

If I can make TASK_STOPPED and TASK_TRACED handle spurious wake-ups
I think I can build something that is independent of the rest of the
freezers so the code doesn't have to go 3 deep on wrappers of different
freezer at those locations.  It is already 2 layers deep.

But I really don't intend to work on that again for a while.


Right now I am in the final stages of ressurecting:
https://lore.kernel.org/linux-fsdevel/87a7ohs5ow.fsf@xmission.com/

The hard part looks like cleaning up and resurrecting Oleg's patch
to prevent the abuse of files_struct->count.
https://lore.kernel.org/linux-fsdevel/20180915160423.GA31461@redhat.com/

I am close but dotting all of the i's and crossing all of the t's is
taking ab bit.

Eric
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 3698252719a3..5e4b0187ac05 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1145,6 +1145,95 @@  static int exec_mmap(struct mm_struct *mm)
 	return 0;
 }
 
+static void exec_reveal_threads_locked(void)
+{
+	struct task_struct *p = current, *t;
+	struct signal_struct *signal = p->signal;
+
+	if (signal->group_execing_task) {
+		signal->group_execing_task = NULL;
+		__for_each_thread(signal, t) {
+			if (t == p)
+				continue;
+			/*
+			 * This might be a spurious wake up task t but
+			 * this should be fine as t should verify it
+			 * is the appropriate time to wake up and if
+			 * not fall back asleep.
+			 *
+			 * Any performance (rather than correctness)
+			 * implications of this code should be unimportant
+			 * as it is only called on error.
+			 */
+			wake_up_state(t, TASK_WAKEKILL);
+		}
+	}
+}
+
+static void exec_reveal_threads(void)
+{
+	spinlock_t *lock = &current->sighand->siglock;
+
+	spin_lock_irq(lock);
+	exec_reveal_threads_locked();
+	spin_unlock_irq(lock);
+}
+
+/*
+ * Conceal all other threads in the thread group from wakeups
+ */
+static int exec_conceal_threads(void)
+{
+	struct task_struct *me = current, *t;
+	struct signal_struct *signal = me->signal;
+	spinlock_t *lock = &me->sighand->siglock;
+	int ret = 0;
+
+	if (thread_group_empty(me))
+		return 0;
+
+	spin_lock_irq(lock);
+	if (signal_pending(me) || signal_group_exit(signal) ||
+	    signal->group_execing_task) {
+		spin_unlock(lock);
+		return -ERESTARTNOINTR;
+	}
+
+	signal->group_execing_task = me;
+	for (;;) {
+		unsigned int todo = 0;
+
+		__for_each_thread(signal, t) {
+			if ((t == me) || (t->flags & PF_EXITING))
+				continue;
+
+			if (make_task_wakekill(t))
+				continue;
+
+			signal_wake_up(t, 0);
+			todo++;
+		}
+
+		if ((todo == 0) || __fatal_signal_pending(me))
+			break;
+
+		set_current_state(TASK_KILLABLE);
+		spin_unlock_irq(lock);
+
+		schedule();
+
+		spin_lock_irq(lock);
+		if (__fatal_signal_pending(me))
+			break;
+	}
+	if (__fatal_signal_pending(me)) {
+		ret = -ERESTARTNOINTR;
+		exec_reveal_threads_locked();
+	}
+	spin_unlock_irq(lock);
+	return ret;
+}
+
 static int de_thread(struct task_struct *tsk)
 {
 	struct signal_struct *sig = tsk->signal;
@@ -1885,10 +1974,15 @@  static int bprm_execve(struct linux_binprm *bprm,
 	struct files_struct *displaced;
 	int retval;
 
-	retval = unshare_files(&displaced);
+	/* Conceal any other threads from wakeups */
+	retval = exec_conceal_threads();
 	if (retval)
 		return retval;
 
+	retval = unshare_files(&displaced);
+	if (retval)
+		goto out_ret;
+
 	retval = prepare_bprm_creds(bprm);
 	if (retval)
 		goto out_files;
@@ -1949,6 +2043,8 @@  static int bprm_execve(struct linux_binprm *bprm,
 out_files:
 	if (displaced)
 		reset_files_struct(displaced);
+out_ret:
+	exec_reveal_threads();
 
 	return retval;
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index edb2020875ad..dcd79e78b651 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1726,6 +1726,8 @@  extern void kick_process(struct task_struct *tsk);
 static inline void kick_process(struct task_struct *tsk) { }
 #endif
 
+extern int make_task_wakekill(struct task_struct *tsk);
+
 extern void __set_task_comm(struct task_struct *tsk, const char *from, bool exec);
 
 static inline void set_task_comm(struct task_struct *tsk, const char *from)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..647b7d0d2231 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -106,6 +106,9 @@  struct signal_struct {
 	int			notify_count;
 	struct task_struct	*group_exit_task;
 
+	/* Task that is performing exec */
+	struct task_struct	*group_execing_task;
+
 	/* thread group stop support, overloads group_exit_code too */
 	int			group_stop_count;
 	unsigned int		flags; /* see SIGNAL_* flags below */
diff --git a/kernel/fork.c b/kernel/fork.c
index bf215af7a904..686c6901eabd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2247,6 +2247,12 @@  static __latent_entropy struct task_struct *copy_process(
 		goto bad_fork_cancel_cgroup;
 	}
 
+	/* Don't allow creation of new tasks during exec */
+	if (unlikely(current->signal->group_execing_task)) {
+		retval = -ERESTARTNOINTR;
+		goto bad_fork_cancel_cgroup;
+	}
+
 	/* past the last point of failure */
 	if (pidfile)
 		fd_install(pidfd, pidfile);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f360326861e..1ac8b81f22de 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2648,6 +2648,44 @@  try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	return success;
 }
 
+int make_task_wakekill(struct task_struct *p)
+{
+	unsigned long flags;
+	int cpu, success = 0;
+	struct rq_flags rf;
+	struct rq *rq;
+	long state;
+
+	/* Assumes p != current */
+	preempt_disable();
+	/*
+	 * If we are going to change a thread waiting for CONDITION we
+	 * need to ensure that CONDITION=1 done by the caller can not be
+	 * reordered with p->state check below. This pairs with mb() in
+	 * set_current_state() the waiting thread does.
+	 */
+	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	smp_mb__after_spinlock();
+	state = p->state;
+
+	/* FIXME handle TASK_STOPPED and TASK_TRACED */
+	if ((state == TASK_KILLABLE) ||
+	    (state == TASK_INTERRUPTIBLE)) {
+		success = 1;
+		cpu = task_cpu(p);
+		rq = cpu_rq(cpu);
+		rq_lock(rq, &rf);
+		p->state = TASK_WAKEKILL;
+		rq_unlock(rq, &rf);
+	}
+	else if (state == TASK_WAKEKILL)
+		success = 1;
+
+	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
+	preempt_enable();
+	return success;
+}
+
 /**
  * try_invoke_on_locked_down_task - Invoke a function on task in fixed state
  * @p: Process for which the function is to be invoked.
diff --git a/kernel/signal.c b/kernel/signal.c
index 5ca48cc5da76..19f73eda1d54 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2590,6 +2590,15 @@  bool get_signal(struct ksignal *ksig)
 		goto fatal;
 	}
 
+	/* Is this task concealing itself from wake-ups during exec? */
+	if (unlikely(signal->group_execing_task)) {
+		set_current_state(TASK_WAKEKILL);
+		wake_up_process(signal->group_execing_task);
+		spin_unlock_irq(&sighand->siglock);
+		schedule();
+		goto relock;
+	}
+
 	for (;;) {
 		struct k_sigaction *ka;
 
@@ -2849,6 +2858,8 @@  void exit_signals(struct task_struct *tsk)
 	    task_participate_group_stop(tsk))
 		group_stop = CLD_STOPPED;
 out:
+	if (unlikely(tsk->signal->group_execing_task))
+		wake_up_process(tsk->signal->group_execing_task);
 	spin_unlock_irq(&tsk->sighand->siglock);
 
 	/*