diff mbox series

[7/7] proc: Ensure we see the exit of each process tid exactly once

Message ID 87r1yp7zhc.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series [1/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes | expand

Commit Message

Eric W. Biederman Feb. 20, 2020, 8:52 p.m. UTC
When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization").  Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently.  Fix
this just so there are no little surprise corner cases waiting to bite
people.

Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/exec.c           |  5 +----
 include/linux/pid.h |  1 +
 kernel/pid.c        | 16 ++++++++++++++++
 3 files changed, 18 insertions(+), 4 deletions(-)

Comments

Oleg Nesterov Feb. 21, 2020, 4:50 p.m. UTC | #1
On 02/20, Eric W. Biederman wrote:
>
> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> +	/* pid_links[PIDTYPE_PID].next is always NULL */
> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
> +	struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(ntask->thread_pid, opid);
> +	rcu_assign_pointer(otask->thread_pid, npid);

this breaks has_group_leader_pid()...

proc_pid_readdir() can miss a process doing mt-exec but this looks fixable,
just we need to update ntask->thread_pid before updating ->first.

The more problematic case is __exit_signal() which does
		
		if (unlikely(has_group_leader_pid(tsk)))
			posix_cpu_timers_exit_group(tsk);

Oleg.
Eric W. Biederman Feb. 22, 2020, 3:46 p.m. UTC | #2
Oleg Nesterov <oleg@redhat.com> writes:

> On 02/20, Eric W. Biederman wrote:
>>
>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> +	/* pid_links[PIDTYPE_PID].next is always NULL */
>> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
>> +	struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(ntask->thread_pid, opid);
>> +	rcu_assign_pointer(otask->thread_pid, npid);
>
> this breaks has_group_leader_pid()...
>
> proc_pid_readdir() can miss a process doing mt-exec but this looks fixable,
> just we need to update ntask->thread_pid before updating ->first.
>
> The more problematic case is __exit_signal() which does
> 		
> 		if (unlikely(has_group_leader_pid(tsk)))
> 			posix_cpu_timers_exit_group(tsk);

Along with the comment:
		/*
		 * This can only happen if the caller is de_thread().
		 * FIXME: this is the temporary hack, we should teach
		 * posix-cpu-timers to handle this case correctly.
		 */
So I suspect this is fixable and the above fix might be part of that.

Hmm looking at your commit:

commit e0a70217107e6f9844628120412cb27bb4cea194
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Fri Nov 5 16:53:42 2010 +0100

    posix-cpu-timers: workaround to suppress the problems with mt exec
    
    posix-cpu-timers.c correctly assumes that the dying process does
    posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
    timers from signal->cpu_timers list.
    
    But, it also assumes that timer->it.cpu.task is always the group
    leader, and thus the dead ->task means the dead thread group.
    
    This is obviously not true after de_thread() changes the leader.
    After that almost every posix_cpu_timer_ method has problems.
    
    It is not simple to fix this bug correctly. First of all, I think
    that timer->it.cpu should use struct pid instead of task_struct.
    Also, the locking should be reworked completely. In particular,
    tasklist_lock should not be used at all. This all needs a lot of
    nontrivial and hard-to-test changes.
    
    Change __exit_signal() to do posix_cpu_timers_exit_group() when
    the old leader dies during exec. This is not the fix, just the
    temporary hack to hide the problem for 2.6.37 and stable. IOW,
    this is obviously wrong but this is what we currently have anyway:
    cpu timers do not work after mt exec.
    
    In theory this change adds another race. The exiting leader can
    detach the timers which were attached to the new leader. However,
    the window between de_thread() and release_task() is small, we
    can pretend that sys_timer_create() was called before de_thread().
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

It looks like the data structures need fixing.  Possibly to use struct
pid.  Possibly to move the group data to signal struct.

I think I played with some of that awhile ago.

I am going to move this change to another patchset.  So I don't wind up
playing shift the bug around.  I thought I would need this to get the
other code working but it turns out we remain bug compatible without
this.

Hopefully I can get something out in the next week or so that addresses
the issues you have pointed out.

Eric
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index db17be51b112..3f0bc293442e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1148,11 +1148,8 @@  static int de_thread(struct task_struct *tsk)
 
 		/* Become a process group leader with the old leader's pid.
 		 * The old leader becomes a thread of the this thread group.
-		 * Note: The old leader also uses this pid until release_task
-		 *       is called.  Odd but simple and correct.
 		 */
-		tsk->pid = leader->pid;
-		change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+		exchange_tids(tsk, leader);
 		transfer_pid(leader, tsk, PIDTYPE_TGID);
 		transfer_pid(leader, tsk, PIDTYPE_PGID);
 		transfer_pid(leader, tsk, PIDTYPE_SID);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 01a0d4e28506..0f40b5f1c32c 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -101,6 +101,7 @@  extern void attach_pid(struct task_struct *task, enum pid_type);
 extern void detach_pid(struct task_struct *task, enum pid_type);
 extern void change_pid(struct task_struct *task, enum pid_type,
 			struct pid *pid);
+extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 0f4ecb57214c..0085b15478fb 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -359,6 +359,22 @@  void change_pid(struct task_struct *task, enum pid_type type,
 	attach_pid(task, type);
 }
 
+void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
+{
+	/* pid_links[PIDTYPE_PID].next is always NULL */
+	struct pid *npid = READ_ONCE(ntask->thread_pid);
+	struct pid *opid = READ_ONCE(otask->thread_pid);
+
+	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(ntask->thread_pid, opid);
+	rcu_assign_pointer(otask->thread_pid, npid);
+	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(ntask->pid, pid_nr(opid));
+	WRITE_ONCE(otask->pid, pid_nr(npid));
+}
+
 /* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
 void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)