Message ID | 875yn3zdag.fsf_-_@email.froward.int.ebiederm.org (mailing list archive) |
---|---|
State | RFC, archived |
Headers | show |
Series | [RFC] ptrace: Don't change __state | expand |
On Wed, Apr 20, 2022 at 03:54:15PM -0500, Eric W. Biederman wrote: > > I was thinking about this and I have an approach from a different > direction. In particular it removes the need for ptrace_freeze_attach > and ptrace_unfreeze_attach to change __state. Instead a jobctl > bit is used to suppress waking up a process with TASK_WAKEKILL. > > I think this would be a good technique to completely decouple > PREEMPT_RT from the work that ptrace_freeze_attach does. > > Comments? On first read-through, I like it! A few comments down below.. > @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task) > * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. > * Recheck state under the lock to close this race. > */ > - spin_lock_irq(&task->sighand->siglock); > - if (READ_ONCE(task->__state) == __TASK_TRACED) { > - if (__fatal_signal_pending(task)) > - wake_up_state(task, __TASK_TRACED); > - else > - WRITE_ONCE(task->__state, TASK_TRACED); > - } > + spin_unlock_irq(&task->sighand->siglock); ^^^^ this should be spin_lock_irq(...) > + WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); > + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; > + if (fatal_signal_pending(task)) > + wake_up_state(task, TASK_WAKEKILL); > spin_unlock_irq(&task->sighand->siglock); > } > > @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) > */ > read_lock(&tasklist_lock); > if (child->ptrace && child->parent == current) { > - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); > + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); > /* > * child->sighand can't be NULL, release_task() > * does ptrace_unlink() before __exit_signal(). > @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) > read_unlock(&tasklist_lock); > > if (!ret && !ignore_state) { > - if (!wait_task_inactive(child, __TASK_TRACED)) { > + if (!wait_task_inactive(child, TASK_TRACED)) { This is still very dubious, there are spinlocks between set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive() can fail where we don't want it to due to TASK_TRACED being temporarily held in ->saved_state. > /* > * This can only happen if may_ptrace_stop() fails and > * ptrace_stop() changes ->state back to TASK_RUNNING, > - * so we should not worry about leaking __TASK_TRACED. > + * so we should not worry about leaking JOBCTL_DELAY_WAKEKILL. > */ > + WARN_ON(!(child->jobctl & JOBCTL_DELAY_WAKEKILL)); > ret = -ESRCH; > } > }
On 04/20, Eric W. Biederman wrote: > > I was thinking about this and I have an approach from a different > direction. In particular it removes the need for ptrace_freeze_attach > and ptrace_unfreeze_attach to change __state. Instead a jobctl > bit is used to suppress waking up a process with TASK_WAKEKILL. I think this can work, but we still need something like 1/5 + 2/5? > I think this would be a good technique to completely decouple > PREEMPT_RT from the work that ptrace_freeze_attach does. If CONFIG_RT=y we can't rely on the ->__state check in task_is_traced(), and wait_task_inactive() can wrongly fail if the tracee sleeps waiting for tasklist_lock. A couple of comments after a quick glance, > static void ptrace_unfreeze_traced(struct task_struct *task) > { > - if (READ_ONCE(task->__state) != __TASK_TRACED) > + if (!task_is_traced(task)) > return; > > WARN_ON(!task->ptrace || task->parent != current); > @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task) > * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. > * Recheck state under the lock to close this race. > */ > - spin_lock_irq(&task->sighand->siglock); > - if (READ_ONCE(task->__state) == __TASK_TRACED) { > - if (__fatal_signal_pending(task)) > - wake_up_state(task, __TASK_TRACED); > - else > - WRITE_ONCE(task->__state, TASK_TRACED); > - } > + spin_unlock_irq(&task->sighand->siglock); > + WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); > + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; We can't rely on the lockless task_is_traced() check above... probably this is fine, but I need to re-chesk. But at least you need to remove the comment about PTRACE_LISTEN above. Another problem is that WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)) doesn't look right if ignore_state in ptrace_check_attach() was true, the tracee could stop before ptrace_unfreeze_traced(). > @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request, > * status and clears the code too; this can't race with the tracee, it > * takes siglock after resume. > */ > - need_siglock = data && !thread_group_empty(current); > if (need_siglock) > spin_lock_irq(&child->sighand->siglock); Hmm? Oleg.
On Thu, Apr 21, 2022 at 09:21:38AM +0200, Peter Zijlstra wrote: > > @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) > > read_unlock(&tasklist_lock); > > > > if (!ret && !ignore_state) { > > - if (!wait_task_inactive(child, __TASK_TRACED)) { > > + if (!wait_task_inactive(child, TASK_TRACED)) { > > This is still very dubious, there are spinlocks between > set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive() > can fail where we don't want it to due to TASK_TRACED being temporarily > held in ->saved_state. As such, I've taken the liberty of munging yours and Oleg's approach together. I've yet to actually test this but it looks feasible. --- --- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -19,9 +19,11 @@ struct task_struct; #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_DELAY_WAKEKILL_BIT 24 /* delay killable wakeups */ -#define JOBCTL_STOPPED_BIT 24 -#define JOBCTL_TRACED_BIT 25 +#define JOBCTL_STOPPED_BIT 25 +#define JOBCTL_TRACED_BIT 26 +#define JOBCTL_TRACED_QUIESCE_BIT 27 #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -31,9 +33,11 @@ struct task_struct; #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_DELAY_WAKEKILL (1UL << JOBCTL_DELAY_WAKEKILL_BIT) #define JOBCTL_STOPPED (1UL << JOBCTL_STOPPED_BIT) #define JOBCTL_TRACED (1UL << JOBCTL_TRACED_BIT) +#define JOBCTL_TRACED_QUIESCE (1UL << JOBCTL_TRACED_QUIESCE_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -193,26 +193,32 @@ static bool looks_like_a_spurious_pid(st */ static bool ptrace_freeze_traced(struct task_struct *task) { + unsigned long flags; bool ret = false; /* Lockless, nobody but us can set this flag */ if (task->jobctl & JOBCTL_LISTENING) return ret; - spin_lock_irq(&task->sighand->siglock); - if (task_is_traced(task) && !looks_like_a_spurious_pid(task) && + if (!lock_task_sighand(task, &flags)) + return ret; + + if (READ_ONCE(task->__state) == TASK_TRACED && + !looks_like_a_spurious_pid(task) && !__fatal_signal_pending(task)) { - WRITE_ONCE(task->__state, __TASK_TRACED); + WARN_ON_ONCE(!task_is_traced(task)); + WARN_ON_ONCE(task->jobctl & JOBCTL_DELAY_WAKEKILL); + task->jobctl |= JOBCTL_DELAY_WAKEKILL; ret = true; } - spin_unlock_irq(&task->sighand->siglock); + unlock_task_sighand(task, &flags); return ret; } static void ptrace_unfreeze_traced(struct task_struct *task) { - if (READ_ONCE(task->__state) != __TASK_TRACED) + if (!task_is_traced(task)) return; WARN_ON(!task->ptrace || task->parent != current); @@ -222,12 +228,11 @@ static void ptrace_unfreeze_traced(struc * Recheck state under the lock to close this race. */ spin_lock_irq(&task->sighand->siglock); - if (READ_ONCE(task->__state) == __TASK_TRACED) { - if (__fatal_signal_pending(task)) { - task->jobctl &= ~JOBCTL_TRACED; - wake_up_state(task, __TASK_TRACED); - } else - WRITE_ONCE(task->__state, TASK_TRACED); +// WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; + if (__fatal_signal_pending(task)) { + task->jobctl &= ~JOBCTL_TRACED; + wake_up_state(task, TASK_WAKEKILL); } spin_unlock_irq(&task->sighand->siglock); } @@ -251,40 +256,46 @@ static void ptrace_unfreeze_traced(struc */ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) { - int ret = -ESRCH; + int traced; /* * We take the read lock around doing both checks to close a - * possible race where someone else was tracing our child and - * detached between these two checks. After this locked check, - * we are sure that this is our traced child and that can only - * be changed by us so it's not changing right after this. + * possible race where someone else attaches or detaches our + * natural child. */ read_lock(&tasklist_lock); - if (child->ptrace && child->parent == current) { - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - /* - * child->sighand can't be NULL, release_task() - * does ptrace_unlink() before __exit_signal(). - */ - if (ignore_state || ptrace_freeze_traced(child)) - ret = 0; - } + traced = child->ptrace && child->parent == current; read_unlock(&tasklist_lock); + if (!traced) + return -ESRCH; - if (!ret && !ignore_state) { - if (!wait_task_inactive(child, __TASK_TRACED)) { - /* - * This can only happen if may_ptrace_stop() fails and - * ptrace_stop() changes ->state back to TASK_RUNNING, - * so we should not worry about leaking __TASK_TRACED. - */ - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - ret = -ESRCH; - } + WARN_ON_ONCE(READ_ONCE(child->__state) == __TASK_TRACED); + WARN_ON_ONCE(READ_ONCE(child->jobctl) & JOBCTL_DELAY_WAKEKILL); + + if (ignore_state) + return 0; + + if (!task_is_traced(child)) + return -ESRCH; + + /* wait for JOBCTL_TRACED_QUIESCE to go away, see ptrace_stop() */ + for (;;) { + if (fatal_signal_pending(current)) + return -EINTR; + + set_current_state(TASK_KILLABLE); + if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_QUIESCE)) + break; + + schedule(); } + __set_current_state(TASK_RUNNING); - return ret; + if (!wait_task_inactive(child, TASK_TRACED) || + !ptrace_freeze_traced(child)) + return -ESRCH; + + return 0; } static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) @@ -1329,8 +1340,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l goto out_put_task_struct; ret = arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); out_put_task_struct: put_task_struct(child); @@ -1472,8 +1482,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); } out_put_task_struct: --- a/kernel/signal.c +++ b/kernel/signal.c @@ -764,6 +764,10 @@ void signal_wake_up_state(struct task_st { lockdep_assert_held(&t->sighand->siglock); + /* Suppress wakekill? */ + if (t->jobctl & JOBCTL_DELAY_WAKEKILL) + state &= ~TASK_WAKEKILL; + set_tsk_thread_flag(t, TIF_SIGPENDING); /* @@ -774,7 +778,7 @@ void signal_wake_up_state(struct task_st * handle its death signal. */ if (wake_up_state(t, state | TASK_INTERRUPTIBLE)) - t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED); + t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE); else kick_process(t); } @@ -2187,6 +2191,15 @@ static void do_notify_parent_cldstop(str spin_unlock_irqrestore(&sighand->siglock, flags); } +static void clear_traced_quiesce(void) +{ + spin_lock_irq(¤t->sighand->siglock); + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); + current->jobctl &= ~JOBCTL_TRACED_QUIESCE; + wake_up_state(current->parent, TASK_KILLABLE); + spin_unlock_irq(¤t->sighand->siglock); +} + /* * This must be called with current->sighand->siglock held. * @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in * schedule() will not sleep if there is a pending signal that * can awaken the task. */ - current->jobctl |= JOBCTL_TRACED; + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; set_special_state(TASK_TRACED); /* @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in /* * Don't want to allow preemption here, because * sys_ptrace() needs this task to be inactive. - * - * XXX: implement read_unlock_no_resched(). */ preempt_disable(); read_unlock(&tasklist_lock); - cgroup_enter_frozen(); + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! + + /* + * JOBCTL_TRACE_QUIESCE bridges the gap between + * set_current_state(TASK_TRACED) above and schedule() below. + * There must not be any blocking (specifically anything that + * touched ->saved_state on PREEMPT_RT) between here and + * schedule(). + * + * ptrace_check_attach() relies on this with its + * wait_task_inactive() usage. + */ + clear_traced_quiesce(); + preempt_enable_no_resched(); freezable_schedule(); + cgroup_leave_frozen(true); } else { /* @@ -2335,6 +2360,7 @@ static int ptrace_stop(int exit_code, in /* LISTENING can be set only during STOP traps, clear it */ current->jobctl &= ~JOBCTL_LISTENING; + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; /* * Queued signals ignored us while we were stopped for tracing.
On 04/21, Peter Zijlstra wrote: > > As such, I've taken the liberty of munging yours and Oleg's approach > together. I've yet to actually test this but it looks feasible. Agreed, but I am a but confused. Is it on top of 1/5? Doesn't look so, and I can't apply it with or without 1/5. So I am not sure I understand what exactly it does. Oleg.
On Thu, Apr 21, 2022 at 12:49:43PM +0200, Oleg Nesterov wrote: > On 04/21, Peter Zijlstra wrote: > > > > As such, I've taken the liberty of munging yours and Oleg's approach > > together. I've yet to actually test this but it looks feasible. > > Agreed, but I am a but confused. Is it on top of 1/5? Doesn't look so, > and I can't apply it with or without 1/5. So I am not sure I understand > what exactly it does. Yes, it is on top of a modified 1, I mean to post an updated and tested series later today. Basically it does the TRACED_QUIESCE thing from your patch, and then does the DELAY_WAKEKILL thing from Eric's patch.
Peter Zijlstra <peterz@infradead.org> writes: > On Wed, Apr 20, 2022 at 03:54:15PM -0500, Eric W. Biederman wrote: >> >> I was thinking about this and I have an approach from a different >> direction. In particular it removes the need for ptrace_freeze_attach >> and ptrace_unfreeze_attach to change __state. Instead a jobctl >> bit is used to suppress waking up a process with TASK_WAKEKILL. >> >> I think this would be a good technique to completely decouple >> PREEMPT_RT from the work that ptrace_freeze_attach does. >> >> Comments? > > On first read-through, I like it! A few comments down below.. > >> @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task) >> * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. >> * Recheck state under the lock to close this race. >> */ >> - spin_lock_irq(&task->sighand->siglock); >> - if (READ_ONCE(task->__state) == __TASK_TRACED) { >> - if (__fatal_signal_pending(task)) >> - wake_up_state(task, __TASK_TRACED); >> - else >> - WRITE_ONCE(task->__state, TASK_TRACED); >> - } >> + spin_unlock_irq(&task->sighand->siglock); > > ^^^^ this should be spin_lock_irq(...) Doh! Thank you for spotting that. That solves my nasty splat in __send_signal. > >> + WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); >> + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; >> + if (fatal_signal_pending(task)) >> + wake_up_state(task, TASK_WAKEKILL); >> spin_unlock_irq(&task->sighand->siglock); >> } >> >> @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) >> */ >> read_lock(&tasklist_lock); >> if (child->ptrace && child->parent == current) { >> - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); >> + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); >> /* >> * child->sighand can't be NULL, release_task() >> * does ptrace_unlink() before __exit_signal(). >> @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) >> read_unlock(&tasklist_lock); >> >> if (!ret && !ignore_state) { >> - if (!wait_task_inactive(child, __TASK_TRACED)) { >> + if (!wait_task_inactive(child, TASK_TRACED)) { > > This is still very dubious, there are spinlocks between > set_current_state(TASK_TRACED) and schedule(), so wait_task_inactive() > can fail where we don't want it to due to TASK_TRACED being temporarily > held in ->saved_state. When it comes to PREEMPT_RT yes. I think we might be able to remove the wait_task_inactive, I am not certain what it gets us. All this change gets us is the removal of playing with __state. Eric
Oleg Nesterov <oleg@redhat.com> writes: > On 04/20, Eric W. Biederman wrote: >> @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request, >> * status and clears the code too; this can't race with the tracee, it >> * takes siglock after resume. >> */ >> - need_siglock = data && !thread_group_empty(current); >> if (need_siglock) >> spin_lock_irq(&child->sighand->siglock); > > Hmm? A half backed out change (I thought ptrace_resume would need to clear JOBCTL_DELAY_WAKEKILL) in ptrace_resume. I somehow failed to restore the need_siglock line. Eric
diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h index fa067de9f1a9..4e154ad8205f 100644 --- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -19,6 +19,7 @@ struct task_struct; #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_DELAY_WAKEKILL_BIT 24 /* delay killable wakeups */ #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -28,6 +29,7 @@ struct task_struct; #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_DELAY_WAKEKILL (1UL << JOBCTL_DELAY_WAKEKILL_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 4f78d0d5940c..a55320a0e8e3 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -197,7 +197,8 @@ static bool ptrace_freeze_traced(struct task_struct *task) spin_lock_irq(&task->sighand->siglock); if (task_is_traced(task) && !looks_like_a_spurious_pid(task) && !__fatal_signal_pending(task)) { - WRITE_ONCE(task->__state, __TASK_TRACED); + //WARN_ON((task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl |= JOBCTL_DELAY_WAKEKILL; ret = true; } spin_unlock_irq(&task->sighand->siglock); @@ -207,7 +208,7 @@ static bool ptrace_freeze_traced(struct task_struct *task) static void ptrace_unfreeze_traced(struct task_struct *task) { - if (READ_ONCE(task->__state) != __TASK_TRACED) + if (!task_is_traced(task)) return; WARN_ON(!task->ptrace || task->parent != current); @@ -216,13 +217,11 @@ static void ptrace_unfreeze_traced(struct task_struct *task) * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. * Recheck state under the lock to close this race. */ - spin_lock_irq(&task->sighand->siglock); - if (READ_ONCE(task->__state) == __TASK_TRACED) { - if (__fatal_signal_pending(task)) - wake_up_state(task, __TASK_TRACED); - else - WRITE_ONCE(task->__state, TASK_TRACED); - } + spin_unlock_irq(&task->sighand->siglock); + WARN_ON(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; + if (fatal_signal_pending(task)) + wake_up_state(task, TASK_WAKEKILL); spin_unlock_irq(&task->sighand->siglock); } @@ -256,7 +255,7 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) */ read_lock(&tasklist_lock); if (child->ptrace && child->parent == current) { - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); + WARN_ON(child->jobctl & JOBCTL_DELAY_WAKEKILL); /* * child->sighand can't be NULL, release_task() * does ptrace_unlink() before __exit_signal(). @@ -267,13 +266,13 @@ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) read_unlock(&tasklist_lock); if (!ret && !ignore_state) { - if (!wait_task_inactive(child, __TASK_TRACED)) { + if (!wait_task_inactive(child, TASK_TRACED)) { /* * This can only happen if may_ptrace_stop() fails and * ptrace_stop() changes ->state back to TASK_RUNNING, - * so we should not worry about leaking __TASK_TRACED. + * so we should not worry about leaking JOBCTL_DELAY_WAKEKILL. */ - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); + WARN_ON(!(child->jobctl & JOBCTL_DELAY_WAKEKILL)); ret = -ESRCH; } } @@ -892,7 +891,6 @@ static int ptrace_resume(struct task_struct *child, long request, * status and clears the code too; this can't race with the tracee, it * takes siglock after resume. */ - need_siglock = data && !thread_group_empty(current); if (need_siglock) spin_lock_irq(&child->sighand->siglock); child->exit_code = data; @@ -1325,8 +1323,7 @@ SYSCALL_DEFINE4(ptrace, long, request, long, pid, unsigned long, addr, goto out_put_task_struct; ret = arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); out_put_task_struct: put_task_struct(child); @@ -1468,8 +1465,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_long_t, request, compat_long_t, pid, request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); } out_put_task_struct: diff --git a/kernel/signal.c b/kernel/signal.c index 30cd1ca43bcd..a1277580c92e 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -762,6 +762,10 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info) */ void signal_wake_up_state(struct task_struct *t, unsigned int state) { + /* Suppress wakekill? */ + if (t->jobctl & JOBCTL_DELAY_WAKEKILL) + state &= ~TASK_WAKEKILL; + set_tsk_thread_flag(t, TIF_SIGPENDING); /* * TASK_WAKEKILL also means wake it up in the stopped/traced/killable @@ -2328,6 +2332,7 @@ static int ptrace_stop(int exit_code, int why, int clear_code, /* LISTENING can be set only during STOP traps, clear it */ current->jobctl &= ~JOBCTL_LISTENING; + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; /* * Queued signals ignored us while we were stopped for tracing.
I was thinking about this and I have an approach from a different direction. In particular it removes the need for ptrace_freeze_attach and ptrace_unfreeze_attach to change __state. Instead a jobctl bit is used to suppress waking up a process with TASK_WAKEKILL. I think this would be a good technique to completely decouple PREEMPT_RT from the work that ptrace_freeze_attach does. Comments? Eric --- The games ptrace_freeze_traced and ptrace_unfreeze_traced have been playing with __state to remove TASK_WAKEKILL from __state while a ptrace command is executing. Instead add JOBCTL_DELAY_WAKEILL and set it in ptrace_freeze_task and clear it in ptrace_unfreeze_task and the final exit of ptrace_stop. Then in signal_wake_up_state drop TASK_WAKEKILL if it is sent while JOBCTL_DELAY_WAKEKILL is set. As signal_wake_up is the only function that sets TASK_WAKEKILL when waking a process this achives the same effect with less messing with state. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> As a proof of concept this seems to work most of the time, unfortunately I have a bug somewhere (a memory stomp?) and I am hitting a BUG_ON that asserts __send_signal is not holding the siglock. [ 97.000168] ------------[ cut here ]------------ [ 97.001782] kernel BUG at kernel/signal.c:1086! [ 97.002539] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 97.002846] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.18.0-rc1userns+ #449 [ 97.003135] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 97.003480] RIP: 0010:__send_signal+0x256/0x580 [ 97.003812] Code: f0 ff ff 49 89 c7 48 85 c0 0f 85 a8 fe ff ff 41 bf 04 00 00 00 44 8d 45 ff e9 f8 fe ff ff 45 31 ff 40 [ 97.003812] RSP: 0018:ffffc900000c8e88 EFLAGS: 00000046 [ 97.003812] RAX: 0000000000000000 RBX: ffff88800875df00 RCX: 0000000000000001 [ 97.003812] RDX: ffff88800875df00 RSI: 0000000000000001 RDI: 000000000000000e [ 97.003812] RBP: 000000000000000e R08: 0000000000000001 R09: 0000000000000000 [ 97.003812] R10: 0000000000000000 R11: ffffc900000c8ff8 R12: 0000000000000000 [ 97.003812] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff8115f3f3 [ 97.003812] FS: 0000000000000000(0000) GS:ffff8880bcb00000(0000) knlGS:0000000000000000 [ 97.003812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 97.003812] CR2: 00007ffff739c830 CR3: 00000000094d6000 CR4: 00000000000006e0 [ 97.003812] Call Trace: [ 97.003812] <IRQ> [ 97.003812] group_send_sig_info+0xe1/0x190 [ 97.003812] kill_pid_info+0x78/0x150 [ 97.003812] it_real_fn+0x35/0xe0 [ 97.003812] ? __x64_sys_setitimer+0x150/0x150 [ 97.003812] __hrtimer_run_queues+0x1ca/0x3f0 [ 97.003812] hrtimer_interrupt+0x106/0x220 [ 97.003812] __sysvec_apic_timer_interrupt+0x96/0x260 [ 97.003812] sysvec_apic_timer_interrupt+0x9d/0xd0 [ 97.003812] </IRQ> [ 97.003812] <TASK> [ 97.003812] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 97.003812] RIP: 0010:default_idle+0x10/0x20 [ 97.003812] Code: 8b 04 25 00 b0 01 00 f0 80 60 02 df c3 0f ae f0 0f ae 38 0f ae f0 eb b9 66 90 0f 1f 44 00 00 eb 07 03 [ 97.003812] RSP: 0018:ffffc90000093ef8 EFLAGS: 00000246 [ 97.003812] RAX: ffffffff823fafb0 RBX: ffff8880056f97c0 RCX: 0000000000000000 [ 97.003812] RDX: ffffffff81193e1d RSI: ffffffff82da5301 RDI: ffffffff823fb111 [ 97.003812] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000 [ 97.003812] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 97.003812] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 97.003812] ? mwait_idle+0x70/0x70 [ 97.003812] ? nohz_balance_enter_idle+0x6d/0x220 [ 97.003812] ? default_idle_call+0x21/0x90 [ 97.003812] ? mwait_idle+0x70/0x70 [ 97.003812] default_idle_call+0x59/0x90 [ 97.003812] do_idle+0x1f3/0x260 [ 97.003812] ? finish_task_switch.isra.0+0xef/0x350 [ 97.003812] cpu_startup_entry+0x19/0x20 [ 97.003812] secondary_startup_64_no_verify+0xc3/0xcb [ 97.003812] </TASK> [ 97.003812] Modules linked in: [ 97.003812] ---[ end trace 0000000000000000 ]--- [ 97.003812] RIP: 0010:__send_signal+0x256/0x580 [ 97.003812] Code: f0 ff ff 49 89 c7 48 85 c0 0f 85 a8 fe ff ff 41 bf 04 00 00 00 44 8d 45 ff e9 f8 fe ff ff 45 31 ff 40 [ 97.003812] RSP: 0018:ffffc900000c8e88 EFLAGS: 00000046 [ 97.003812] RAX: 0000000000000000 RBX: ffff88800875df00 RCX: 0000000000000001 [ 97.003812] RDX: ffff88800875df00 RSI: 0000000000000001 RDI: 000000000000000e [ 97.003812] RBP: 000000000000000e R08: 0000000000000001 R09: 0000000000000000 [ 97.003812] R10: 0000000000000000 R11: ffffc900000c8ff8 R12: 0000000000000000 [ 97.003812] R13: 0000000000000001 R14: 0000000000000001 R15: ffffffff8115f3f3 [ 97.003812] FS: 0000000000000000(0000) GS:ffff8880bcb00000(0000) knlGS:0000000000000000 [ 97.003812] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 97.003812] CR2: 00007ffff739c830 CR3: 00000000094d6000 CR4: 00000000000006e0 [ 97.003812] Kernel panic - not syncing: Fatal exception in interrupt [ 97.003812] Kernel Offset: disabled [ 97.003812] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- --- include/linux/sched/jobctl.h | 2 ++ kernel/ptrace.c | 32 ++++++++++++++------------------ kernel/signal.c | 5 +++++ 3 files changed, 21 insertions(+), 18 deletions(-)