Message ID | 20220421150654.817117821@infradead.org (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | ptrace-vs-PREEMPT_RT and freezer rewrite | expand |
On 04/21, Peter Zijlstra wrote: > > Rework ptrace_check_attach() / ptrace_unfreeze_traced() to not rely on > task->__state as much. Looks good after the quick glance... but to me honest I got lost and I need to apply these patches and read the code carefully. However, I am not able to do this until Monday, sorry. Just one nit for now, > static void ptrace_unfreeze_traced(struct task_struct *task) > { > - if (READ_ONCE(task->__state) != __TASK_TRACED) > + if (!task_is_traced(task)) > return; > > WARN_ON(!task->ptrace || task->parent != current); > > - /* > - * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. > - * Recheck state under the lock to close this race. > - */ > spin_lock_irq(&task->sighand->siglock); > - if (READ_ONCE(task->__state) == __TASK_TRACED) { > + if (task_is_traced(task)) { I think ptrace_unfreeze_traced() should not use task_is_traced() at all. I think a single lockless if (task->jobctl & JOBCTL_DELAY_WAKEKILL) return; at the start should be enough? Nobody else can set this flag. It can be cleared by the tracee if it was woken up, so perhaps we can check it again but afaics this is not strictly needed. > +// WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); Did you really want to add the commented WARN_ON_ONCE? Oleg.
Peter Zijlstra <peterz@infradead.org> writes: > Rework ptrace_check_attach() / ptrace_unfreeze_traced() to not rely on > task->__state as much. > > Due to how PREEMPT_RT is changing the rules vs task->__state with the > introduction of task->saved_state while TASK_RTLOCK_WAIT (the whole > blocking spinlock thing), the way ptrace freeze tries to do things no > longer works. The problem with ptrace_stop and do_signal_stop that requires dropping siglock and grabbing tasklist_lock is that do_notify_parent_cldstop needs tasklist_lock to keep parent and real_parent stable. With just some very modest code changes it looks like we can use a processes own siglock to keep parent and real_parent stable. The siglock is already acquired in all of those places it is just not held over the changing parent and real_parent. Then make a rule that a child's siglock must be grabbed before a parents siglock and do_notify_parent_cldstop can be always be called under the childs siglock. This means ptrace_stop can be significantly simplified, and the notifications can be moved far enough up that set_special_state can be called after do_notify_parent_cldstop. With the result that there is simply no PREEMPT_RT issue to worry about and wait_task_inactive can be used as is. I remember Oleg suggesting a change something like this a long time ago. I need to handle the case where the parent and the child share the same sighand but that is just remembering to handle it in do_notify_parent_cldstop, as the handling is simply not taking the lock twice. I am going to play with that and see if I there are any gotcha's I missed when looking through the code. Eric
On Thu, Apr 21, 2022 at 08:23:26PM +0200, Oleg Nesterov wrote: > On 04/21, Peter Zijlstra wrote: > > > > Rework ptrace_check_attach() / ptrace_unfreeze_traced() to not rely on > > task->__state as much. > > Looks good after the quick glance... but to me honest I got lost and > I need to apply these patches and read the code carefully. > > However, I am not able to do this until Monday, sorry. Sure, no worries. Take your time. > Just one nit for now, > > > static void ptrace_unfreeze_traced(struct task_struct *task) > > { > > - if (READ_ONCE(task->__state) != __TASK_TRACED) > > + if (!task_is_traced(task)) > > return; > > > > WARN_ON(!task->ptrace || task->parent != current); > > > > - /* > > - * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. > > - * Recheck state under the lock to close this race. > > - */ > > spin_lock_irq(&task->sighand->siglock); > > - if (READ_ONCE(task->__state) == __TASK_TRACED) { > > + if (task_is_traced(task)) { > > I think ptrace_unfreeze_traced() should not use task_is_traced() at all. > I think a single lockless > > if (task->jobctl & JOBCTL_DELAY_WAKEKILL) > return; > > at the start should be enough? I think so. That is indeed cleaner. I'll make the change if I don't see anything wrong with it in the morning when the brain has woken up again ;-) > > Nobody else can set this flag. It can be cleared by the tracee if it was > woken up, so perhaps we can check it again but afaics this is not strictly > needed. > > > +// WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); > > Did you really want to add the commented WARN_ON_ONCE? I did that because: @@ -1472,8 +1479,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); } Can now call unfreeze too often. I left the comment in because I need to think more about why Eric did that and see if it really is needed.
On 04/21, Peter Zijlstra wrote: > > +static void clear_traced_quiesce(void) > +{ > + spin_lock_irq(¤t->sighand->siglock); > + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); This WARN_ON_ONCE() doesn't look right, the task can be killed right after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and drops siglock. > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in > /* > * Don't want to allow preemption here, because > * sys_ptrace() needs this task to be inactive. > - * > - * XXX: implement read_unlock_no_resched(). > */ > preempt_disable(); > read_unlock(&tasklist_lock); > - cgroup_enter_frozen(); > + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! > + > + /* > + * JOBCTL_TRACE_QUIESCE bridges the gap between > + * set_current_state(TASK_TRACED) above and schedule() below. > + * There must not be any blocking (specifically anything that > + * touched ->saved_state on PREEMPT_RT) between here and > + * schedule(). > + * > + * ptrace_check_attach() relies on this with its > + * wait_task_inactive() usage. > + */ > + clear_traced_quiesce(); Well, I think it should be called earlier under tasklist_lock, before preempt_disable() above. We need tasklist_lock to protect ->parent, debugger can be killed and go away right after read_unlock(&tasklist_lock). Still trying to convince myself everything is right with JOBCTL_STOPPED/TRACED ... Oleg.
On 04/21, Peter Zijlstra wrote: > > @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in > * schedule() will not sleep if there is a pending signal that > * can awaken the task. > */ > - current->jobctl |= JOBCTL_TRACED; > + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; > set_special_state(TASK_TRACED); OK, this looks wrong. I actually mean the previous patch which sets JOBCTL_TRACED. The problem is that the tracee can be already killed, so that fatal_signal_pending(current) is true. In this case we can't rely on signal_wake_up_state() which should clear JOBCTL_TRACED, or the callers of ptrace_signal_wake_up/etc which clear this flag by hand. In this case schedule() won't block and ptrace_stop() will leak JOBCTL_TRACED. Unless I missed something. We could check fatal_signal_pending() and damn! this is what I think ptrace_stop() should have done from the very beginning. But for now I'd suggest to simply clear this flag before return, along with DELAY_WAKEKILL and LISTENING. > current->jobctl &= ~JOBCTL_LISTENING; > + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; current->jobctl &= ~(~JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING); Oleg.
On Mon, Apr 25, 2022 at 04:35:37PM +0200, Oleg Nesterov wrote: > On 04/21, Peter Zijlstra wrote: > > > > +static void clear_traced_quiesce(void) > > +{ > > + spin_lock_irq(¤t->sighand->siglock); > > + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); > > This WARN_ON_ONCE() doesn't look right, the task can be killed right > after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and > drops siglock. OK, will look at that. > > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in > > /* > > * Don't want to allow preemption here, because > > * sys_ptrace() needs this task to be inactive. > > - * > > - * XXX: implement read_unlock_no_resched(). > > */ > > preempt_disable(); > > read_unlock(&tasklist_lock); > > - cgroup_enter_frozen(); > > + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! > > + > > + /* > > + * JOBCTL_TRACE_QUIESCE bridges the gap between > > + * set_current_state(TASK_TRACED) above and schedule() below. > > + * There must not be any blocking (specifically anything that > > + * touched ->saved_state on PREEMPT_RT) between here and > > + * schedule(). > > + * > > + * ptrace_check_attach() relies on this with its > > + * wait_task_inactive() usage. > > + */ > > + clear_traced_quiesce(); > > Well, I think it should be called earlier under tasklist_lock, > before preempt_disable() above. > > We need tasklist_lock to protect ->parent, debugger can be killed > and go away right after read_unlock(&tasklist_lock). > > Still trying to convince myself everything is right with > JOBCTL_STOPPED/TRACED ... Can't do it earlier, since cgroup_enter_frozen() can do spinlock (eg. use ->saved_state).
Peter Zijlstra <peterz@infradead.org> writes: > On Mon, Apr 25, 2022 at 04:35:37PM +0200, Oleg Nesterov wrote: >> On 04/21, Peter Zijlstra wrote: >> > >> > +static void clear_traced_quiesce(void) >> > +{ >> > + spin_lock_irq(¤t->sighand->siglock); >> > + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); >> >> This WARN_ON_ONCE() doesn't look right, the task can be killed right >> after ptrace_stop() sets JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE and >> drops siglock. > > OK, will look at that. > >> > @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in >> > /* >> > * Don't want to allow preemption here, because >> > * sys_ptrace() needs this task to be inactive. >> > - * >> > - * XXX: implement read_unlock_no_resched(). >> > */ >> > preempt_disable(); >> > read_unlock(&tasklist_lock); >> > - cgroup_enter_frozen(); >> > + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! >> > + >> > + /* >> > + * JOBCTL_TRACE_QUIESCE bridges the gap between >> > + * set_current_state(TASK_TRACED) above and schedule() below. >> > + * There must not be any blocking (specifically anything that >> > + * touched ->saved_state on PREEMPT_RT) between here and >> > + * schedule(). >> > + * >> > + * ptrace_check_attach() relies on this with its >> > + * wait_task_inactive() usage. >> > + */ >> > + clear_traced_quiesce(); >> >> Well, I think it should be called earlier under tasklist_lock, >> before preempt_disable() above. >> >> We need tasklist_lock to protect ->parent, debugger can be killed >> and go away right after read_unlock(&tasklist_lock). >> >> Still trying to convince myself everything is right with >> JOBCTL_STOPPED/TRACED ... > > Can't do it earlier, since cgroup_enter_frozen() can do spinlock (eg. > use ->saved_state). There are some other issues in this part of ptrace_stop(). I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)". Currently in ptrace_check_attach a parameter of __TASK_TRACED is passed so that wait_task_inactive cane fail if the "!current->ptrace" branch of ptrace_stop is take and ptrace_stop does not stop. With the TASK_FROZEN state it appears that "!current->ptrace" branch can continue and freeze somewhere else and wait_task_inactive could decided it was fine. I have to run, but hopefully tommorrow I will post the patches that remove the "!current->ptrace" case altogether and basically remove the need for quiesce and wait_task_inactive detecting which branch is taken. The spinlock in cgroup_enter_frozen remains an issue for PREEMPT_RT. But the rest of the issues are cleared up by using siglock instead of tasklist_lock. Plus the code is just easier to read and understand. Eric
On 04/25, Eric W. Biederman wrote: > > I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)". As Peter explained, in this case we can rely on __ptrace_unlink() which should clear this flag. Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/25, Eric W. Biederman wrote: >> >> I don't see JOBCTL_TRACED_QUIESCE being cleared "if (!current->ptrace)". > > As Peter explained, in this case we can rely on __ptrace_unlink() which > should clear this flag. I had missed that that signal_wake_up_state was clearing JOBCTL_TRACED_QUIESCE. Relying on __ptrace_unlink assumes the __ptrace_unlink happens after siglock is taken before calling ptrace_stop. Especially with the ptrace_notify in signal_delivered that does not look guaranteed. The __ptrace_unlink could also happen during arch_ptrace_stop. Relying on siglock is sufficient because __ptrace_unlink holds siglock over clearing task->ptrace. Which means that the simple fix for this is to just test task->ptrace before we set JOBCTL_TRACED_QUEIESCE. Eric
On 04/26, Eric W. Biederman wrote: > > Relying on __ptrace_unlink assumes the __ptrace_unlink happens after > siglock is taken before calling ptrace_stop. Especially with the > ptrace_notify in signal_delivered that does not look guaranteed. > > The __ptrace_unlink could also happen during arch_ptrace_stop. > > Relying on siglock is sufficient because __ptrace_unlink holds siglock > over clearing task->ptrace. Which means that the simple fix for this is > to just test task->ptrace before we set JOBCTL_TRACED_QUEIESCE. Or simply clear _QUEIESCE along with _TRACED/DELAY_WAKEKILL before return? Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/21, Peter Zijlstra wrote: >> >> @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in >> * schedule() will not sleep if there is a pending signal that >> * can awaken the task. >> */ >> - current->jobctl |= JOBCTL_TRACED; >> + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; >> set_special_state(TASK_TRACED); > > OK, this looks wrong. I actually mean the previous patch which sets > JOBCTL_TRACED. > > The problem is that the tracee can be already killed, so that > fatal_signal_pending(current) is true. In this case we can't rely on > signal_wake_up_state() which should clear JOBCTL_TRACED, or the > callers of ptrace_signal_wake_up/etc which clear this flag by hand. > > In this case schedule() won't block and ptrace_stop() will leak > JOBCTL_TRACED. Unless I missed something. > > We could check fatal_signal_pending() and damn! this is what I think > ptrace_stop() should have done from the very beginning. But for now > I'd suggest to simply clear this flag before return, along with > DELAY_WAKEKILL and LISTENING. Oh. That is an interesting case for JOBCTL_TRACED. The scheduler refuses to stop if signal_pending_state(TASK_TRACED, p) returns true. The ptrace_stop code used to handle this explicitly and in commit 7d613f9f72ec ("signal: Remove the bogus sigkill_pending in ptrace_stop") I actually removed the test. As the test was somewhat wrong and redundant, and in slightly the wrong location. But doing: /* Don't stop if the task is dying */ if (unlikely(__fatal_signal_pending(current))) return exit_code; Should work. > >> current->jobctl &= ~JOBCTL_LISTENING; >> + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; > > current->jobctl &= > ~(~JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING); I presume you meant: current->jobctl &= ~(JOBCTL_TRACED | JOBCTL_DELAY_WAKEKILL | JOBCTL_LISTENING); I don't think we want to do that. For the case you are worried about it is a valid fix. In general this is the wrong approach as we want the waker to clear JOBCTL_TRACED. If the waker does not it is possible that ptrace_freeze_traced might attempt to freeze a process whose state is not appropriate for attach, because the code is past the call to schedule(). In fact I think clearing JOBCTL_TRACED at the end of ptrace_stop will allow ptrace_freeze_traced to come in while siglock is dropped, expect the process to stop, and have the process not stop. Of course wait_task_inactive coming first that might not be a problem. This is a minor problem with the patchset I just posted. I thought the only reason wait_task_inactive could fail was if ptrace_stop() hit the !current->ptrace case. Thinking about any it any SIGKILL coming in before tracee stops in schedule will trigger this, so it is not as safe as I thought to not pass a state into wait_task_inactive. It is time for me to shut down today. I will sleep on that and see what I can see tomorrow. Eric
On 04/21, Peter Zijlstra wrote: > > @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l > goto out_put_task_struct; > > ret = arch_ptrace(child, request, addr, data); > - if (ret || request != PTRACE_DETACH) > - ptrace_unfreeze_traced(child); > + ptrace_unfreeze_traced(child); Forgot to mention... whatever we do this doesn't look right. ptrace_unfreeze_traced() must not be called if the tracee was untraced, anothet debugger can come after that. I agree, the current code looks a bit confusing, perhaps it makes sense to re-write it: if (request == PTRACE_DETACH && ret == 0) ; /* nothing to do, no longer traced by us */ else ptrace_unfreeze_traced(child); Oleg.
Oleg Nesterov <oleg@redhat.com> writes: > On 04/21, Peter Zijlstra wrote: >> >> @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l >> goto out_put_task_struct; >> >> ret = arch_ptrace(child, request, addr, data); >> - if (ret || request != PTRACE_DETACH) >> - ptrace_unfreeze_traced(child); >> + ptrace_unfreeze_traced(child); > > Forgot to mention... whatever we do this doesn't look right. > > ptrace_unfreeze_traced() must not be called if the tracee was untraced, > anothet debugger can come after that. I agree, the current code looks > a bit confusing, perhaps it makes sense to re-write it: > > if (request == PTRACE_DETACH && ret == 0) > ; /* nothing to do, no longer traced by us */ > else > ptrace_unfreeze_traced(child); This was a bug in my original JOBCTL_DELAY_WAITKILL patch and it was just cut and pasted here. I thought it made sense when I was throwing things together but when I looked more closely I realized that it is not safe. Eric
On Tue, Apr 26, 2022 at 07:24:03PM -0500, Eric W. Biederman wrote: > But doing: > > /* Don't stop if the task is dying */ > if (unlikely(__fatal_signal_pending(current))) > return exit_code; > > Should work. Something like so then... --- Subject: signal,ptrace: Don't stop dying tasks From: Peter Zijlstra <peterz@infradead.org> Date: Thu Apr 28 22:17:56 CEST 2022 Oleg pointed out that the tracee can already be killed such that fatal_signal_pending() is true. In that case signal_wake_up_state() cannot be relied upon to be responsible for the wakeup -- something we're going to want to rely on. As such, explicitly handle this case. Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- kernel/signal.c | 4 ++++ 1 file changed, 4 insertions(+) --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2226,6 +2226,10 @@ static int ptrace_stop(int exit_code, in spin_lock_irq(¤t->sighand->siglock); } + /* Don't stop if the task is dying. */ + if (unlikely(__fatal_signal_pending(current))) + return exit_code; + /* * schedule() will not sleep if there is a pending signal that * can awaken the task.
On 04/28, Peter Zijlstra wrote: > > Oleg pointed out that the tracee can already be killed such that > fatal_signal_pending() is true. In that case signal_wake_up_state() > cannot be relied upon to be responsible for the wakeup -- something > we're going to want to rely on. Peter, I am all confused... If this patch is against the current tree, we don't need it. If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's), then it can't help - SIGKILL can come right after the tracee drops siglock and calls schedule(). Perhaps I missed something, but let me repeat the 3rd time: I'd suggest to simply clear JOBCTL_TRACED along with LISTENING/DELAY_WAKEKILL before return to close this race. Oleg. > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2226,6 +2226,10 @@ static int ptrace_stop(int exit_code, in > spin_lock_irq(¤t->sighand->siglock); > } > > + /* Don't stop if the task is dying. */ > + if (unlikely(__fatal_signal_pending(current))) > + return exit_code; > + > /* > * schedule() will not sleep if there is a pending signal that > * can awaken the task. >
On Thu, Apr 28, 2022 at 10:59:57PM +0200, Oleg Nesterov wrote: > On 04/28, Peter Zijlstra wrote: > > > > Oleg pointed out that the tracee can already be killed such that > > fatal_signal_pending() is true. In that case signal_wake_up_state() > > cannot be relied upon to be responsible for the wakeup -- something > > we're going to want to rely on. > > Peter, I am all confused... > > If this patch is against the current tree, we don't need it. > > If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's), > then it can't help - SIGKILL can come right after the tracee drops siglock > and calls schedule(). But by that time it will already have set TRACED and signal_wake_up() wil clear it, no? > Perhaps I missed something, but let me repeat the 3rd time: I'd suggest > to simply clear JOBCTL_TRACED along with LISTENING/DELAY_WAKEKILL before > return to close this race. I think Eric convinced me there was a problem with that, but I'll go over it all again in the morning, perhaps I'll reach a different conclusion :-)
Peter, you know, it is very difficult to me to discuss the changes in the 2 unfinished series and not loose the context ;) Plus I am already sleeping. But I'll try to reply anyway. On 04/29, Peter Zijlstra wrote: > > On Thu, Apr 28, 2022 at 10:59:57PM +0200, Oleg Nesterov wrote: > > If it is on top of JOBCTL_TRACED/DELAY_WAKEKILL changes (yours or Eric's), > > then it can't help - SIGKILL can come right after the tracee drops siglock > > and calls schedule(). > > But by that time it will already have set TRACED and signal_wake_up() > wil clear it, no? No. JOBCTL_DELAY_WAKEKILL is already set, this means that signal_wake_up() will remove TASK_WAKEKILL from the "state" passed to signal_wake_up_state() and this is fine and correct, this mean thats ttwu() won't change ->__state. But this also mean that wake_up_state() will return false, and in this case signal_wake_up_state: if (wake_up_state(t, state | TASK_INTERRUPTIBLE)) t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE); won't clear these flags. And this is nice too. But. fatal_signal_pending() is true! And once we change freeze_traced() to not abuse p->__state, schedule() won't block because it will check signal_pending_state(TASK_TRACED == TASK_WAKEKILL | __TASK_TRACED) and __fatal_signal_pending() == T. In this case ptrace_stop() will leak JOBCTL_TRACED, so we simply need to clear it before return along with LISTENING | DELAY_WAKEKILL. > I'll go > over it all again in the morning, perhaps I'll reach a different > conclusion :-) Same here ;) Oleg.
--- a/include/linux/sched/jobctl.h +++ b/include/linux/sched/jobctl.h @@ -19,9 +19,11 @@ struct task_struct; #define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */ #define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */ #define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */ +#define JOBCTL_DELAY_WAKEKILL_BIT 24 /* delay killable wakeups */ -#define JOBCTL_STOPPED_BIT 24 /* do_signal_stop() */ -#define JOBCTL_TRACED_BIT 25 /* ptrace_stop() */ +#define JOBCTL_STOPPED_BIT 25 /* do_signal_stop() */ +#define JOBCTL_TRACED_BIT 26 /* ptrace_stop() */ +#define JOBCTL_TRACED_QUIESCE_BIT 27 #define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT) #define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT) @@ -31,9 +33,11 @@ struct task_struct; #define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT) #define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT) #define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT) +#define JOBCTL_DELAY_WAKEKILL (1UL << JOBCTL_DELAY_WAKEKILL_BIT) #define JOBCTL_STOPPED (1UL << JOBCTL_STOPPED_BIT) #define JOBCTL_TRACED (1UL << JOBCTL_TRACED_BIT) +#define JOBCTL_TRACED_QUIESCE (1UL << JOBCTL_TRACED_QUIESCE_BIT) #define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY) #define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK) --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -193,41 +193,44 @@ static bool looks_like_a_spurious_pid(st */ static bool ptrace_freeze_traced(struct task_struct *task) { + unsigned long flags; bool ret = false; /* Lockless, nobody but us can set this flag */ if (task->jobctl & JOBCTL_LISTENING) return ret; - spin_lock_irq(&task->sighand->siglock); - if (task_is_traced(task) && !looks_like_a_spurious_pid(task) && + if (!lock_task_sighand(task, &flags)) + return ret; + + if (task_is_traced(task) && + !looks_like_a_spurious_pid(task) && !__fatal_signal_pending(task)) { - WRITE_ONCE(task->__state, __TASK_TRACED); + WARN_ON_ONCE(READ_ONCE(task->__state) != TASK_TRACED); + WARN_ON_ONCE(task->jobctl & JOBCTL_DELAY_WAKEKILL); + task->jobctl |= JOBCTL_DELAY_WAKEKILL; ret = true; } - spin_unlock_irq(&task->sighand->siglock); + unlock_task_sighand(task, &flags); return ret; } static void ptrace_unfreeze_traced(struct task_struct *task) { - if (READ_ONCE(task->__state) != __TASK_TRACED) + if (!task_is_traced(task)) return; WARN_ON(!task->ptrace || task->parent != current); - /* - * PTRACE_LISTEN can allow ptrace_trap_notify to wake us up remotely. - * Recheck state under the lock to close this race. - */ spin_lock_irq(&task->sighand->siglock); - if (READ_ONCE(task->__state) == __TASK_TRACED) { + if (task_is_traced(task)) { +// WARN_ON_ONCE(!(task->jobctl & JOBCTL_DELAY_WAKEKILL)); + task->jobctl &= ~JOBCTL_DELAY_WAKEKILL; if (__fatal_signal_pending(task)) { task->jobctl &= ~JOBCTL_TRACED; - wake_up_state(task, __TASK_TRACED); - } else - WRITE_ONCE(task->__state, TASK_TRACED); + wake_up_state(task, TASK_WAKEKILL); + } } spin_unlock_irq(&task->sighand->siglock); } @@ -251,40 +254,45 @@ static void ptrace_unfreeze_traced(struc */ static int ptrace_check_attach(struct task_struct *child, bool ignore_state) { - int ret = -ESRCH; + int traced; /* * We take the read lock around doing both checks to close a - * possible race where someone else was tracing our child and - * detached between these two checks. After this locked check, - * we are sure that this is our traced child and that can only - * be changed by us so it's not changing right after this. + * possible race where someone else attaches or detaches our + * natural child. */ read_lock(&tasklist_lock); - if (child->ptrace && child->parent == current) { - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - /* - * child->sighand can't be NULL, release_task() - * does ptrace_unlink() before __exit_signal(). - */ - if (ignore_state || ptrace_freeze_traced(child)) - ret = 0; - } + traced = child->ptrace && child->parent == current; read_unlock(&tasklist_lock); + if (!traced) + return -ESRCH; - if (!ret && !ignore_state) { - if (!wait_task_inactive(child, __TASK_TRACED)) { - /* - * This can only happen if may_ptrace_stop() fails and - * ptrace_stop() changes ->state back to TASK_RUNNING, - * so we should not worry about leaking __TASK_TRACED. - */ - WARN_ON(READ_ONCE(child->__state) == __TASK_TRACED); - ret = -ESRCH; - } + if (ignore_state) + return 0; + + if (!task_is_traced(child)) + return -ESRCH; + + WARN_ON_ONCE(READ_ONCE(child->jobctl) & JOBCTL_DELAY_WAKEKILL); + + /* Wait for JOBCTL_TRACED_QUIESCE to go away, see ptrace_stop(). */ + for (;;) { + if (fatal_signal_pending(current)) + return -EINTR; + + set_current_state(TASK_KILLABLE); + if (!(READ_ONCE(child->jobctl) & JOBCTL_TRACED_QUIESCE)) + break; + + schedule(); } + __set_current_state(TASK_RUNNING); - return ret; + if (!wait_task_inactive(child, TASK_TRACED) || + !ptrace_freeze_traced(child)) + return -ESRCH; + + return 0; } static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) @@ -1329,8 +1337,7 @@ SYSCALL_DEFINE4(ptrace, long, request, l goto out_put_task_struct; ret = arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); out_put_task_struct: put_task_struct(child); @@ -1472,8 +1479,7 @@ COMPAT_SYSCALL_DEFINE4(ptrace, compat_lo request == PTRACE_INTERRUPT); if (!ret) { ret = compat_arch_ptrace(child, request, addr, data); - if (ret || request != PTRACE_DETACH) - ptrace_unfreeze_traced(child); + ptrace_unfreeze_traced(child); } out_put_task_struct: --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6310,10 +6310,7 @@ static void __sched notrace __schedule(u /* * We must load prev->state once (task_struct::state is volatile), such - * that: - * - * - we form a control dependency vs deactivate_task() below. - * - ptrace_{,un}freeze_traced() can change ->state underneath us. + * that we form a control dependency vs deactivate_task() below. */ prev_state = READ_ONCE(prev->__state); if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) { --- a/kernel/signal.c +++ b/kernel/signal.c @@ -764,6 +764,10 @@ void signal_wake_up_state(struct task_st { lockdep_assert_held(&t->sighand->siglock); + /* Suppress wakekill? */ + if (t->jobctl & JOBCTL_DELAY_WAKEKILL) + state &= ~TASK_WAKEKILL; + set_tsk_thread_flag(t, TIF_SIGPENDING); /* @@ -774,7 +778,7 @@ void signal_wake_up_state(struct task_st * handle its death signal. */ if (wake_up_state(t, state | TASK_INTERRUPTIBLE)) - t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED); + t->jobctl &= ~(JOBCTL_STOPPED | JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE); else kick_process(t); } @@ -2187,6 +2191,15 @@ static void do_notify_parent_cldstop(str spin_unlock_irqrestore(&sighand->siglock, flags); } +static void clear_traced_quiesce(void) +{ + spin_lock_irq(¤t->sighand->siglock); + WARN_ON_ONCE(!(current->jobctl & JOBCTL_TRACED_QUIESCE)); + current->jobctl &= ~JOBCTL_TRACED_QUIESCE; + wake_up_state(current->parent, TASK_KILLABLE); + spin_unlock_irq(¤t->sighand->siglock); +} + /* * This must be called with current->sighand->siglock held. * @@ -2225,7 +2238,7 @@ static int ptrace_stop(int exit_code, in * schedule() will not sleep if there is a pending signal that * can awaken the task. */ - current->jobctl |= JOBCTL_TRACED; + current->jobctl |= JOBCTL_TRACED | JOBCTL_TRACED_QUIESCE; set_special_state(TASK_TRACED); /* @@ -2290,14 +2303,26 @@ static int ptrace_stop(int exit_code, in /* * Don't want to allow preemption here, because * sys_ptrace() needs this task to be inactive. - * - * XXX: implement read_unlock_no_resched(). */ preempt_disable(); read_unlock(&tasklist_lock); - cgroup_enter_frozen(); + cgroup_enter_frozen(); // XXX broken on PREEMPT_RT !!! + + /* + * JOBCTL_TRACE_QUIESCE bridges the gap between + * set_current_state(TASK_TRACED) above and schedule() below. + * There must not be any blocking (specifically anything that + * touched ->saved_state on PREEMPT_RT) between here and + * schedule(). + * + * ptrace_check_attach() relies on this with its + * wait_task_inactive() usage. + */ + clear_traced_quiesce(); + preempt_enable_no_resched(); freezable_schedule(); + cgroup_leave_frozen(true); } else { /* @@ -2335,6 +2360,7 @@ static int ptrace_stop(int exit_code, in /* LISTENING can be set only during STOP traps, clear it */ current->jobctl &= ~JOBCTL_LISTENING; + current->jobctl &= ~JOBCTL_DELAY_WAKEKILL; /* * Queued signals ignored us while we were stopped for tracing.
Rework ptrace_check_attach() / ptrace_unfreeze_traced() to not rely on task->__state as much. Due to how PREEMPT_RT is changing the rules vs task->__state with the introduction of task->saved_state while TASK_RTLOCK_WAIT (the whole blocking spinlock thing), the way ptrace freeze tries to do things no longer works. Specifically there are two problems: - due to ->saved_state, the ->__state modification removing TASK_WAKEKILL no longer works reliably. - due to ->saved_state, wait_task_inactive() also no longer works reliably. The first problem is solved by a suggestion from Eric that instead of changing __state, TASK_WAKEKILL be delayed. The second problem is solved by a suggestion from Oleg; add JOBCTL_TRACED_QUIESCE to cover the chunk of code between set_current_state(TASK_TRACED) and schedule(), such that ptrace_check_attach() can first wait for JOBCTL_TRACED_QUIESCE to get cleared, and then use wait_task_inactive(). Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched/jobctl.h | 8 ++- kernel/ptrace.c | 90 ++++++++++++++++++++++--------------------- kernel/sched/core.c | 5 -- kernel/signal.c | 36 ++++++++++++++--- 4 files changed, 86 insertions(+), 53 deletions(-)