diff mbox series

[v2,2/2] proc: Ensure we see the exit of each process tid exactly

Message ID 87wo66vvnm.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] proc: Use PIDTYPE_TGID in next_tgid | expand

Commit Message

Eric W. Biederman April 23, 2020, 7:39 p.m. UTC
When the thread group leader changes during exec and the old leaders
thread is reaped proc_flush_pid will flush the dentries for the entire
process because the leader still has it's original pid.

Fix this by exchanging the pids in an rcu safe manner,
and wrapping the code to do that up in a helper exchange_tids.

When I removed switch_exec_pids and introduced this behavior
in d73d65293e3e ("[PATCH] pidhash: kill switch_exec_pids") there
really was nothing that cared as flushing happened with
the cached dentry and de_thread flushed both of them on exec.

This lack of fully exchanging pids became a problem a few months later
when I introduced 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry
flush on exit optimization").  Which overlooked the de_thread case
was no longer swapping pids, and I was looking up proc dentries
by task->pid.

The current behavior isn't properly a bug as everything in proc will
continue to work correctly just a little bit less efficiently.  Fix
this just so there are no little surprise corner cases waiting to bite
people.

-- Oleg points out this could be an issue in next_tgid in proc where
   has_group_leader_pid is called, and reording some of the assignments
   should fix that.

-- Oleg points out this will break the 10 year old hack in __exit_signal.c
>	/*
>	 * This can only happen if the caller is de_thread().
>	 * FIXME: this is the temporary hack, we should teach
>	 * posix-cpu-timers to handle this case correctly.
>	 */
>	if (unlikely(has_group_leader_pid(tsk)))
>		posix_cpu_timers_exit_group(tsk);

The code in next_tgid has been changed to use PIDTYPE_TGID,
and the posix cpu timers code has been fixed so it does not
need the 10 year old hack, so this should be safe to merge
now.

Fixes: 48e6484d4902 ("[PATCH] proc: Rewrite the proc dentry flush on exit optimization").
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 fs/exec.c           |  5 +----
 include/linux/pid.h |  1 +
 kernel/pid.c        | 16 ++++++++++++++++
 3 files changed, 18 insertions(+), 4 deletions(-)

Comments

Linus Torvalds April 23, 2020, 8:28 p.m. UTC | #1
On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> +       /* pid_links[PIDTYPE_PID].next is always NULL */
> +       struct pid *npid = READ_ONCE(ntask->thread_pid);
> +       struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> +       rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> +       rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> +       rcu_assign_pointer(ntask->thread_pid, opid);
> +       rcu_assign_pointer(otask->thread_pid, npid);
> +       WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
> +       WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
> +       WRITE_ONCE(ntask->pid, pid_nr(opid));
> +       WRITE_ONCE(otask->pid, pid_nr(npid));
> +}

This function is _very_ hard to read as written.

It really wants a helper function to do the swapping per hlist_head
and hlist_node, I think. And "opid/npid" is very hard to see, and the
naming doesn't make much sense (if it's an "exchange", then why is it
"old/new" - they're symmetric).

At least something like

        struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID;
        struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID;
        struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID;
        struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID;

        struct hlist_node *old_first_node = old_pid_hlist->first;
        struct hlist_node *new_first_node = new_pid_hlist->first;

and then trying to group up the first/pprev/thread_pid/pid  accesses
so that you them together, and using a helper function that does the
whole switch, so that you'd have

        /* Move new node to old hlist, and update thread_pid/pid fields */
        insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node);
        rcu_assign_pointer(ntask->thread_pid, opid);
        WRITE_ONCE(ntask->pid, pid_nr(opid));

        /* Move old new to new hlist, and update thread_pid/pid fields */
        insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node);
        rcu_assign_pointer(otask->thread_pid, npid);
        WRITE_ONCE(otask->pid, pid_nr(npid));

or something roughly like that.

(And the above still uses "old/new", which as mentioned sounds wrong
to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I
did this in my MUA, so I could have gotten the names and types wrong
etc).

I think that would make it look at least _slightly_ less like random
line noise and easier to follow.

But maybe even a rcu_hlist_swap() helper? We have one for regular
lists. Do we really have to do it all written out, not do it with a
"remove and reinsert" model?

                Linus
Eric W. Biederman April 24, 2020, 3:33 a.m. UTC | #2
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 23, 2020 at 12:42 PM Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>>
>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> +       /* pid_links[PIDTYPE_PID].next is always NULL */
>> +       struct pid *npid = READ_ONCE(ntask->thread_pid);
>> +       struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> +       rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> +       rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> +       rcu_assign_pointer(ntask->thread_pid, opid);
>> +       rcu_assign_pointer(otask->thread_pid, npid);
>> +       WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
>> +       WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
>> +       WRITE_ONCE(ntask->pid, pid_nr(opid));
>> +       WRITE_ONCE(otask->pid, pid_nr(npid));
>> +}
>
> This function is _very_ hard to read as written.
>
> It really wants a helper function to do the swapping per hlist_head
> and hlist_node, I think. And "opid/npid" is very hard to see, and the
> naming doesn't make much sense (if it's an "exchange", then why is it
> "old/new" - they're symmetric).
>
> At least something like
>
>         struct hlist_head *old_pid_hlist = opid->tasks + PIDTYPE_PID;
>         struct hlist_head *new_pid_hlist = npid->tasks + PIDTYPE_PID;
>         struct hlist_node *old_pid_node = otask->pid_links + PIDTYPE_PID;
>         struct hlist_node *new_pid_node = ntask->pid_links + PIDTYPE_PID;
>
>         struct hlist_node *old_first_node = old_pid_hlist->first;
>         struct hlist_node *new_first_node = new_pid_hlist->first;
>
> and then trying to group up the first/pprev/thread_pid/pid  accesses
> so that you them together, and using a helper function that does the
> whole switch, so that you'd have
>
>         /* Move new node to old hlist, and update thread_pid/pid fields */
>         insert_pid_pointers(old_pid_hlist, new_pid_node, new_first_node);
>         rcu_assign_pointer(ntask->thread_pid, opid);
>         WRITE_ONCE(ntask->pid, pid_nr(opid));
>
>         /* Move old new to new hlist, and update thread_pid/pid fields */
>         insert_pid_pointers(new_pid_hlist, old_pid_node, old_first_node);
>         rcu_assign_pointer(otask->thread_pid, npid);
>         WRITE_ONCE(otask->pid, pid_nr(npid));
>
> or something roughly like that.
>
> (And the above still uses "old/new", which as mentioned sounds wrong
> to me. Maybe it should just be "a_xyz" and "b_xyz"? Also note that I
> did this in my MUA, so I could have gotten the names and types wrong
> etc).
>
> I think that would make it look at least _slightly_ less like random
> line noise and easier to follow.
>
> But maybe even a rcu_hlist_swap() helper? We have one for regular
> lists. Do we really have to do it all written out, not do it with a
> "remove and reinsert" model?

At one point my brain I had forgetten that xchg can not take two memory
arguments and had hoped to be able to provide stronger guarnatees than I
can.  Which is where I think the structure of exchange_pids came from.

I do agree the clearer we can write things, the easier it is for
someone else to come along and follow.

We can not use a remove and reinser model because that does break rcu
accesses, and complicates everything else.  With a swap model we have
the struct pids pointer at either of the tasks that are swapped but
never at nothing.  With a remove/reinsert model we have to deal the
addittional possibility of the pids not pointing at a thread at all
which can result in things like signals not being delivered at all.

I played with it a bit and the best I have been able to come up is:

	void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
	{
		struct hlist_node **lpprev = left->pprev;
		struct hlist_node **rpprev = right->pprev;
	
		rcu_assign_pointer(*lpprev, right);
		rcu_assign_pointer(*rpprev, left);
		WRITE_ONCE(left->pprev,  rpprev);
		WRITE_ONCE(right->pprev, lpprev);
	}
	
	void exchange_tids(struct task_struct *left, struct task_struct *right)
	{
		struct hlist_node *lnode = &left->pid_links[PIDTYPE_PID];
		struct hlist_node *rnode = &right->pid_links[PIDTYPE_PID];
		struct pid *lpid, *rpid;
	
		/* Replace the single entry tid lists with each other */
		hlist_swap_before_rcu(lnode, rnode);

		/* Swap thread_pid */
		rpid = left->thread_pid;
		lpid = right->thread_pid;
		rcu_assign_pointer(left->thread_pid, lpid);
		rcu_assign_pointer(right->thread_pid, rpid);

                /* Swap the cached pid value */
		WRITE_ONCE(left->pid, pid_nr(lpid));
		WRITE_ONCE(right->pid, pid_nr(rpid));
	}

hlists because they are not doubly linked can legitimately swap their
beginnings or their tails.  Something that regular lists can not,
and I think that is exactly the general purpose semantic I want.

Does that look a little more readable?

Eric
Oleg Nesterov April 24, 2020, 5:39 p.m. UTC | #3
On 04/23, Eric W. Biederman wrote:
>
> When the thread group leader changes during exec and the old leaders
> thread is reaped proc_flush_pid

This is off-topic, but let me mention this before I forget...

Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean
that in this case release_task() leaks thread_pid.

> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
> +{
> +	/* pid_links[PIDTYPE_PID].next is always NULL */
> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
> +	struct pid *opid = READ_ONCE(otask->thread_pid);
> +
> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
> +	rcu_assign_pointer(ntask->thread_pid, opid);
> +	rcu_assign_pointer(otask->thread_pid, npid);
> +	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
> +	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
> +	WRITE_ONCE(ntask->pid, pid_nr(opid));
> +	WRITE_ONCE(otask->pid, pid_nr(npid));
> +}

Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last
user of has_group_leader_pid().

I think that we should change lookup_task() to return "struct *pid", this
should simplify the code... Note that none of its callers needs task_struct.

And, instead of thread_group_leader/has_group_leader_pid checks we should
use pid_has_task(TGID).

After that, this patch should kill has_group_leader_pid().

What do you think?

Oleg.
Linus Torvalds April 24, 2020, 6:02 p.m. UTC | #4
On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> At one point my brain I had forgetten that xchg can not take two memory
> arguments and had hoped to be able to provide stronger guarnatees than I
> can.  Which is where I think the structure of exchange_pids came from.

Note that even if we were to have a "exchange two memory locations
atomically" instruction (and we don't - even a "double cmpxchg" is
actually just a double-_sized_ one, not a two different locations
one), I'm not convinced it makes sense.

There's no way to _walk_ two lists atomically. Any user will only ever
walk one or the other, so it's not sensible to try to make the two
list updates be atomic.

And if a user for some reason walks both, the walking itself will
obviously then be racy - it does one or the other first, and can see
either the old state, or the new state - or see _neither_ (ie if you
walk it twice, you might see neither task, or you might see both, just
depending on order or walk).

> I do agree the clearer we can write things, the easier it is for
> someone else to come along and follow.

Your alternate write of the function seems a bit more readable to me,
even if the main effect might be just that it was split up a bit and
added a few comments and whitespace.

So I'm more happier with that one. That said:

> We can not use a remove and reinser model because that does break rcu
> accesses, and complicates everything else.  With a swap model we have
> the struct pids pointer at either of the tasks that are swapped but
> never at nothing.

I'm not suggesting removing the pid entirely - like making task->pid
be NULL. I'm literally suggesting just doing the RCU list operations
as "remove and re-insert".

And that shouldn't break anything, for the same reason that an atomic
exchange doesn't make sense: you can only ever walk one of the lists
at a time. And regardless of how you walk it, you might not see the
new state (or the old state) reliably.

Put another way:

>         void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
>         {
>                 struct hlist_node **lpprev = left->pprev;
>                 struct hlist_node **rpprev = right->pprev;
>
>                 rcu_assign_pointer(*lpprev, right);
>                 rcu_assign_pointer(*rpprev, left);

These are the only two assignments that matter for anything that walks
the list (the pprev ones are for things that change the list, and they
have to have exclusions in place).

And those two writes cannot be atomic anyway, so you fundamentally
will always be in the situation that a walker can miss one of the
tasks.

Which is why I think it would be ok to just do the RCU list swap as a
"remove left, remove right, add left, add right" operation. It doesn't
seem fundamentally different to a walker than the "switch left/right"
operation, and it seems much simpler.

Is there something I'm missing?

But I'm *not* suggesting that we change these simple parts to be
"remove thread_pid or pid pointer, and then insert a new one":

>                 /* Swap thread_pid */
>                 rpid = left->thread_pid;
>                 lpid = right->thread_pid;
>                 rcu_assign_pointer(left->thread_pid, lpid);
>                 rcu_assign_pointer(right->thread_pid, rpid);
>
>                 /* Swap the cached pid value */
>                 WRITE_ONCE(left->pid, pid_nr(lpid));
>                 WRITE_ONCE(right->pid, pid_nr(rpid));
>         }

because I agree that for things that don't _walk_ the list, but just
look up "thread_pid" vs "pid" atomically but asynchronously, we
obviously need to get one or the other, not some kind of "empty"
state.

> Does that look a little more readable?

Regardless, I find your new version at least a lot more readable, so
I'm ok with it.

It looks like Oleg found an independent issue, though.

                  Linus
Eric W. Biederman April 24, 2020, 6:10 p.m. UTC | #5
Oleg Nesterov <oleg@redhat.com> writes:

> On 04/23, Eric W. Biederman wrote:
>>
>> When the thread group leader changes during exec and the old leaders
>> thread is reaped proc_flush_pid
>
> This is off-topic, but let me mention this before I forget...
>
> Note that proc_flush_pid() does nothing if CONFIG_PROC_FS=n, this mean
> that in this case release_task() leaks thread_pid.

Good catch.  Wow.  I seem to be introducing almost as many bugs as I am
fixing right now.  Ouch.

>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> +	/* pid_links[PIDTYPE_PID].next is always NULL */
>> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
>> +	struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(ntask->thread_pid, opid);
>> +	rcu_assign_pointer(otask->thread_pid, npid);
>> +	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
>> +	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
>> +	WRITE_ONCE(ntask->pid, pid_nr(opid));
>> +	WRITE_ONCE(otask->pid, pid_nr(npid));
>> +}
>
> Oh, at first glance this breaks posix-cpu-timers.c:lookup_task(), the last
> user of has_group_leader_pid().
>
> I think that we should change lookup_task() to return "struct *pid", this
> should simplify the code... Note that none of its callers needs task_struct.
>
> And, instead of thread_group_leader/has_group_leader_pid checks we should
> use pid_has_task(TGID).

Somehow I thought we could get away without fiddling with that right
now, but on second glance I can see the races.

I played with this earlier and I agree returning a struct pid *
is desirable.  I will see if I can track down the patches I was
playing with as that definitely needs to get fixed first.

> After that, this patch should kill has_group_leader_pid().
>
> What do you think?

I agree completely.  has_group_leader_pid is the same as
thread_group_leader after this so should be removed.  Especially as it
won't have any users.

There are several other potential cleanups as well.  Such as not
using a hlist for PIDTYPE_PID.  Which would allow us to run the hlists
through struct signal_struct instead.  I think that would clean things
up but that touches so many things it may just be pointless code churn.

Just for mentioning I am thinking we should rename PIDTYPE_PID to
PIDTYPE_TID just to create a distance in peoples minds between
the kernel concepts and the user space concepts.

Eric
Linus Torvalds April 24, 2020, 6:46 p.m. UTC | #6
On Fri, Apr 24, 2020 at 11:02 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>  [..] even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one

Historical accuracy side note: the 68020 actually had a CAS2 that was
"two different locations".

Maybe somebody else did too.

            Linus
Eric W. Biederman April 24, 2020, 7:51 p.m. UTC | #7
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Apr 23, 2020 at 8:36 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> At one point my brain I had forgetten that xchg can not take two memory
>> arguments and had hoped to be able to provide stronger guarnatees than I
>> can.  Which is where I think the structure of exchange_pids came from.
>
> Note that even if we were to have a "exchange two memory locations
> atomically" instruction (and we don't - even a "double cmpxchg" is
> actually just a double-_sized_ one, not a two different locations
> one), I'm not convinced it makes sense.
>
> There's no way to _walk_ two lists atomically. Any user will only ever
> walk one or the other, so it's not sensible to try to make the two
> list updates be atomic.
>
> And if a user for some reason walks both, the walking itself will
> obviously then be racy - it does one or the other first, and can see
> either the old state, or the new state - or see _neither_ (ie if you
> walk it twice, you might see neither task, or you might see both, just
> depending on order or walk).
>
>> I do agree the clearer we can write things, the easier it is for
>> someone else to come along and follow.
>
> Your alternate write of the function seems a bit more readable to me,
> even if the main effect might be just that it was split up a bit and
> added a few comments and whitespace.
>
> So I'm more happier with that one. That said:
>
>> We can not use a remove and reinser model because that does break rcu
>> accesses, and complicates everything else.  With a swap model we have
>> the struct pids pointer at either of the tasks that are swapped but
>> never at nothing.
>
> I'm not suggesting removing the pid entirely - like making task->pid
> be NULL. I'm literally suggesting just doing the RCU list operations
> as "remove and re-insert".
>
> And that shouldn't break anything, for the same reason that an atomic
> exchange doesn't make sense: you can only ever walk one of the lists
> at a time. And regardless of how you walk it, you might not see the
> new state (or the old state) reliably.
>
> Put another way:
>
>>         void hlist_swap_before_rcu(struct hlist_node *left, struct hlist_node *right)
>>         {
>>                 struct hlist_node **lpprev = left->pprev;
>>                 struct hlist_node **rpprev = right->pprev;
>>
>>                 rcu_assign_pointer(*lpprev, right);
>>                 rcu_assign_pointer(*rpprev, left);
>
> These are the only two assignments that matter for anything that walks
> the list (the pprev ones are for things that change the list, and they
> have to have exclusions in place).
>
> And those two writes cannot be atomic anyway, so you fundamentally
> will always be in the situation that a walker can miss one of the
> tasks.
>
> Which is why I think it would be ok to just do the RCU list swap as a
> "remove left, remove right, add left, add right" operation. It doesn't
> seem fundamentally different to a walker than the "switch left/right"
> operation, and it seems much simpler.
>
> Is there something I'm missing?


The problem with

remove
remove
add
add
is:

A lookup that hit between the remove and the add could return nothing.

The function kill_pid_info does everything it can to handle this case
today does:

int kill_pid_info(int sig, struct kernel_siginfo *info, struct pid *pid)
{
	int error = -ESRCH;
	struct task_struct *p;

	for (;;) {
		rcu_read_lock();
		p = pid_task(pid, PIDTYPE_PID);
		if (p)
			error = group_send_sig_info(sig, info, p, PIDTYPE_TGID);
		rcu_read_unlock();
		if (likely(!p || error != -ESRCH))
			return error;

		/*
		 * The task was unhashed in between, try again.  If it
		 * is dead, pid_task() will return NULL, if we race with
		 * de_thread() it will find the new leader.
		 */
	}
}

Now kill_pid_info is signalling the entire task and is just using
PIDTYPE_PID to find a thread in the task.

With the remove then add model there will be a point where pid_task
will return nothing, because ever so briefly the lists will be
empty.

However with an actually swap we will find a task and kill_pid_info
will work.  It pathloglical cases lock_task_sighand might have to loop
and we would need to find the new task that has the given pid.  But
kill_pid_info is guaranteed to work with swaps and will fail with
remove add.


> But I'm *not* suggesting that we change these simple parts to be
> "remove thread_pid or pid pointer, and then insert a new one":
>
>>                 /* Swap thread_pid */
>>                 rpid = left->thread_pid;
>>                 lpid = right->thread_pid;
>>                 rcu_assign_pointer(left->thread_pid, lpid);
>>                 rcu_assign_pointer(right->thread_pid, rpid);
>>
>>                 /* Swap the cached pid value */
>>                 WRITE_ONCE(left->pid, pid_nr(lpid));
>>                 WRITE_ONCE(right->pid, pid_nr(rpid));
>>         }
>
> because I agree that for things that don't _walk_ the list, but just
> look up "thread_pid" vs "pid" atomically but asynchronously, we
> obviously need to get one or the other, not some kind of "empty"
> state.

For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
pointers to the appropriate task.  Only for PIDTYPE_PGID and PIDTYPE_SID
do these become lists in practice.

That not-really-a-list status allows for signel delivery to indivdual
processes to happen in rcu context.  Which is where we would get into
trouble with add/remove.

Since signals are guaranteed to be delivered to the entire session
or the entire process group all of the list walking happens under
the tasklist_lock currently.  Which really keeps list walking from
being a concern.

>> Does that look a little more readable?
>
> Regardless, I find your new version at least a lot more readable, so
> I'm ok with it.

Good. Then I will finish cleaning it up and go with that version.

> It looks like Oleg found an independent issue, though.

Yes, and I will definitely work through those.

Eric
Linus Torvalds April 24, 2020, 8:10 p.m. UTC | #8
On Fri, Apr 24, 2020 at 12:54 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
>
> The problem with
>
> remove
> remove
> add
> add
> is:
>
> A lookup that hit between the remove and the add could return nothing.

Argh. Because that thing doesn't actually _search_ the list, it just
wants to pick any (first) entry. So it doesn't actually care what it
gets, just that it gets something.

Ok, I see.

> For PIDTYPE_PID and PIDTYPE_TGID these practically aren't lists but
> pointers to the appropriate task.  Only for PIDTYPE_PGID and PIDTYPE_SID
> do these become lists in practice.

Yeah, that's what confused me. Ok, please add a comment about the
"single-entry list" issue, and I'm happy.

            Linus
Oleg Nesterov April 26, 2020, 5:22 p.m. UTC | #9
Eric,

I am sick today and can't read the code, but I feel this patch is not
right ... please correct me.

So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD
is false we roughly have

	task = pid_task(pid, PIDTYPE_TGID);			// lookup_task()

	/* WINDOW */

	timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID)	// posix_cpu_timer_create()

Now suppose that we race with mt-exec and this "task" is the old leader;
it can be release_task()'ed in the WINDOW above and then get_task_pid()
will return NULL.

That is why I suggested to change lookup_task() to return "struct pid*"
to eliminate the pid -> task -> pid transition.

Apart from the same_thread_group() check for the "thread" case we do not
need task_struct at all, lookup_task() can do

	if (thread) {
		p = pid_task(pid, PIDTYPE_PID);
		if (p && !same_thread_group(p, current))
			pid = NULL;
	} else {
		... gettime check ...

		if (!pid_has_task(pid, PIDTYPE_TGID))
			pid = NULL;
	}

	return pid;

No?

Oleg.
Linus Torvalds April 26, 2020, 5:40 p.m. UTC | #10
On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> To support this add hlist_swap_before_rcu.  An hlist primitive that
> will allow swaping the leading sections of two tasks.  For exchanging
> the task pids it will just be swapping the hlist_heads of two single
> entry lists.  But the functionality is more general.

So I have no problems with the series now - the code is much more
understandable. Partly because of the split-up, partly because of the
comments, and partly because you explained the special case and why it
was a valid thing to do...

However, I did start thinking about this case again.

I still don't think the "swap entry" macro is necessarily useful in
_general_ - any time it's an actual individual entry, that swap macro
doesn't really work.

So the only reason it works here is because you're actually swapping
the whole list.

But that, in turn, shouldn't be using that "first node" model at all,
it should use the hlist_head. That would have made it a lot more
obvious what is actually going on to me.

Now, the comment very much talks about the head case, but the code
still looks like it's swapping a non-head thing.

I guess the code technically _works_ with "swap two list ends", but
does that actually really make sense as an operation?

So I no longer hate how this patch looks, but I wonder if we should
just make the whole "this node is the *first* node" a bit more
explicit in both the caller and in the swapping code.

It could be as simple as replacing just the conceptual types and
names, so instead of some "pnode1" double-indirect node pointer, we'd
have

        struct hlist_head *left_head = container_of(left->pprev,
struct hlist_head, first);
        struct hlist_head *right_head = container_of(right->pprev,
struct hlist_head, first);

and then the code would do

        rcu_assign_pointer(right_head->first, left);
        rcu_assign_pointer(left_head->first, right);
        WRITE_ONCE(left->pprev,  &right_head->first);
        WRITE_ONCE(right->pprev,  &left_head->first);

which should generate the exact same code, but makes it clear that
what we're doing is switching the whole hlist when given the first
entries.

Doesn't that make what it actually does a lot more understandable? The
*pnode1/pnode2 games are somewhat opaque, but with that type and name
change and using "container_of()", the code now fairly naturally reads
as "oh, we're changing the first pointers in the list heads, and
making the nodes point back to them" .

Again - the current function _works_ with swapping two hlists in the
middle (not two entries - it swaps the whole list starting at that
entry!), so your current patch is in some ways "more generic". I'm
just suggesting that the generic case doesn't make much sense, and
that the "we know the first entries, swap the lists" actually is what
the real use is, and writing it as such makes the code easier to
understand.

But I'm not going to insist on this, so this is more an RFC. Maybe
people disagree, and/or have an actual use case for that "break two
hlists in the middle, swap the ends" that I find unlikely...

(NOTE: My "convert to hlist_head" code _works_ for that case too
because the code generation is the same! But it would be really really
confusing for that code to be used for anything but the first entry).

                       Linus
Thomas Gleixner April 27, 2020, 9:43 a.m. UTC | #11
ebiederm@xmission.com (Eric W. Biederman) writes:

> This allows the getref flag to be removed and the callers can
> than take a task reference if needed.

That changelog lacks any form of information why this should be
changed. I can see the point vs. patch 2, but pretty please put coherent
explanations into each patch.

> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  kernel/time/posix-cpu-timers.c | 41 +++++++++++++++++-----------------
>  1 file changed, 21 insertions(+), 20 deletions(-)
>
> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
> index 2fd3b3fa68bf..eba41c70f0f0 100644
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -86,36 +86,34 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>  }
>  
>  static struct task_struct *__get_task_for_clock(const clockid_t clock,
> -						bool getref, bool gettime)
> +						bool gettime)
>  {
>  	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
>  	const pid_t pid = CPUCLOCK_PID(clock);
> -	struct task_struct *p;
>  
>  	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
>  		return NULL;
>  
> -	rcu_read_lock();
> -	p = lookup_task(pid, thread, gettime);
> -	if (p && getref)
> -		get_task_struct(p);
> -	rcu_read_unlock();
> -	return p;
> +	return lookup_task(pid, thread, gettime);
>  }
>  
>  static inline struct task_struct *get_task_for_clock(const clockid_t clock)
>  {
> -	return __get_task_for_clock(clock, true, false);
> +	return __get_task_for_clock(clock, false);
>  }
>  
>  static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
>  {
> -	return __get_task_for_clock(clock, true, true);
> +	return __get_task_for_clock(clock, true);
>  }
>  
>  static inline int validate_clock_permissions(const clockid_t clock)
>  {
> -	return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL;
> +	int ret;

New line between declarations and code please.

> +	rcu_read_lock();
> +	ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
> +	rcu_read_unlock();
> +	return ret;
>  }

Thanks,

        tglx
Thomas Gleixner April 27, 2020, 10:32 a.m. UTC | #12
ebiederm@xmission.com (Eric W. Biederman) writes:
> Using pid_task(find_vpid(N), PIDTYPE_TGID) guarantees that if a task
> is found it is at that moment the thread group leader.  Which removes
> the need for the follow on test has_group_leader_pid.
>
> I have reorganized the rest of the code in lookup_task for clarity,
> and created a common return for most of the code.

Sorry, it's way harder to read than the very explicit exits which were
there before.

> The special case for clock_gettime with "pid == gettid" is not my
> favorite.  I strongly suspect it isn't used as gettid is such a pain,
> and passing 0 is much easier.  Still it is easier to keep this special
> case than to do the reasarch that will show it isn't used.

It might be not your favorite, but when I refactored the code I learned
the hard way that one of the test suites has assumptions that
clock_gettime(PROCESS) works from any task of a group and not just for
the group leader. Sure we could fix the test suite, but test code tends
to be copied ...

>  /*
>   * Functions for validating access to tasks.
>   */
> -static struct task_struct *lookup_task(const pid_t pid, bool thread,
> +static struct task_struct *lookup_task(const pid_t which_pid, bool thread,
>  				       bool gettime)
>  {
>  	struct task_struct *p;
> +	struct pid *pid;
>  
>  	/*
>  	 * If the encoded PID is 0, then the timer is targeted at current
>  	 * or the process to which current belongs.
>  	 */
> -	if (!pid)
> +	if (!which_pid)
>  		return thread ? current : current->group_leader;
>  
> -	p = find_task_by_vpid(pid);
> -	if (!p)
> -		return p;
> -
> -	if (thread)
> -		return same_thread_group(p, current) ? p : NULL;
> -
> -	if (gettime) {
> +	pid = find_vpid(which_pid);
> +	if (thread) {
> +		p = pid_task(pid, PIDTYPE_PID);
> +		if (p && !same_thread_group(p, current))
> +			p = NULL;
> +	} else {
>  		/*
>  		 * For clock_gettime(PROCESS) the task does not need to be
>  		 * the actual group leader. tsk->sighand gives
> @@ -76,13 +75,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>  		 * reference on it and store the task pointer until the
>  		 * timer is destroyed.

Btw, this comment is wrong since

     55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")

Thanks,

        tglx
Eric W. Biederman April 27, 2020, 11:51 a.m. UTC | #13
Oleg Nesterov <oleg@redhat.com> writes:

> Eric,
>
> I am sick today and can't read the code, but I feel this patch is not
> right ... please correct me.


> So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD
> is false we roughly have
>
> 	task = pid_task(pid, PIDTYPE_TGID);			// lookup_task()
>
> 	/* WINDOW */
>
> 	timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID)	// posix_cpu_timer_create()
>
> Now suppose that we race with mt-exec and this "task" is the old leader;
> it can be release_task()'ed in the WINDOW above and then get_task_pid()
> will return NULL.

Except it is asking for PIDTYPE_TGID.

task->signal even if it is freed (which it won't be in a mt-exec)
is valid until after an rcu window.

release_task()
   put_task_struct_rcu_user()
      call_rcu(..., delayed_put_task_struct())
... rcu delay ...
delayed_put_task_struct()
   put_task_struct()
      __put_task_struct()
         put_signal_struct()
            free_signal_struct()

Which means that task->signal->pids[PIDTYPE_TGID] will remain valid even
across mt-exec.

Further the only change I have introduced is to perform this work under
rcu_read_lock vs taking a reference to task_struct.  As the reference to
task_struct does not prevent release_task, the situation with respect
to races in the rest of the code does not change.

Hmm....

If the case instead is:
> 	timer->it.cpu.pid = get_task_pid(task, PIDTYPE_PID)	// posix_cpu_timer_create()

Which can also happen for threads in the same thread group.
I have to agree that we can wind up with a NULL pid.

And that is a brand new bug, because we didn't use to use pids.
Sigh.


> That is why I suggested to change lookup_task() to return "struct pid*"
> to eliminate the pid -> task -> pid transition.

Yes.  I have to agree.  Getting rid of the pid -> task -> pid transition
looks important to close bugs like that.

> Apart from the same_thread_group() check for the "thread" case we do not
> need task_struct at all, lookup_task() can do
>
> 	if (thread) {
> 		p = pid_task(pid, PIDTYPE_PID);
> 		if (p && !same_thread_group(p, current))
> 			pid = NULL;
> 	} else {
> 		... gettime check ...
>
> 		if (!pid_has_task(pid, PIDTYPE_TGID))
> 			pid = NULL;
> 	}
>
> 	return pid;
>
> No?

There is also the posix_cpu_clock_get, where we immediately use the
clock instead of create something we can use later.

I want to say the gettime case is another reason to go through the whole
transition but the code can just as easily say "pid = task_tgid(current)"
as it can "p = current";

Eric
Eric W. Biederman April 27, 2020, 11:53 a.m. UTC | #14
Thomas Gleixner <tglx@linutronix.de> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> This allows the getref flag to be removed and the callers can
>> than take a task reference if needed.
>
> That changelog lacks any form of information why this should be
> changed. I can see the point vs. patch 2, but pretty please put coherent
> explanations into each patch.

Well excess flags bad.  But in this case I can do even better.

The code no longer takes a reference on task_struct so this patch
removes unnecessary code.

I will see if I can say that better.



>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  kernel/time/posix-cpu-timers.c | 41 +++++++++++++++++-----------------
>>  1 file changed, 21 insertions(+), 20 deletions(-)
>>
>> diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
>> index 2fd3b3fa68bf..eba41c70f0f0 100644
>> --- a/kernel/time/posix-cpu-timers.c
>> +++ b/kernel/time/posix-cpu-timers.c
>> @@ -86,36 +86,34 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>>  }
>>  
>>  static struct task_struct *__get_task_for_clock(const clockid_t clock,
>> -						bool getref, bool gettime)
>> +						bool gettime)
>>  {
>>  	const bool thread = !!CPUCLOCK_PERTHREAD(clock);
>>  	const pid_t pid = CPUCLOCK_PID(clock);
>> -	struct task_struct *p;
>>  
>>  	if (CPUCLOCK_WHICH(clock) >= CPUCLOCK_MAX)
>>  		return NULL;
>>  
>> -	rcu_read_lock();
>> -	p = lookup_task(pid, thread, gettime);
>> -	if (p && getref)
>> -		get_task_struct(p);
>> -	rcu_read_unlock();
>> -	return p;
>> +	return lookup_task(pid, thread, gettime);
>>  }
>>  
>>  static inline struct task_struct *get_task_for_clock(const clockid_t clock)
>>  {
>> -	return __get_task_for_clock(clock, true, false);
>> +	return __get_task_for_clock(clock, false);
>>  }
>>  
>>  static inline struct task_struct *get_task_for_clock_get(const clockid_t clock)
>>  {
>> -	return __get_task_for_clock(clock, true, true);
>> +	return __get_task_for_clock(clock, true);
>>  }
>>  
>>  static inline int validate_clock_permissions(const clockid_t clock)
>>  {
>> -	return __get_task_for_clock(clock, false, false) ? 0 : -EINVAL;
>> +	int ret;
>
> New line between declarations and code please.
>
>> +	rcu_read_lock();
>> +	ret = __get_task_for_clock(clock, false) ? 0 : -EINVAL;
>> +	rcu_read_unlock();
>> +	return ret;
>>  }
>
> Thanks,
>
>         tglx
Eric W. Biederman April 27, 2020, 2:28 p.m. UTC | #15
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sun, Apr 26, 2020 at 7:14 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> To support this add hlist_swap_before_rcu.  An hlist primitive that
>> will allow swaping the leading sections of two tasks.  For exchanging
>> the task pids it will just be swapping the hlist_heads of two single
>> entry lists.  But the functionality is more general.
>
> So I have no problems with the series now - the code is much more
> understandable. Partly because of the split-up, partly because of the
> comments, and partly because you explained the special case and why it
> was a valid thing to do...
>
> However, I did start thinking about this case again.
>
> I still don't think the "swap entry" macro is necessarily useful in
> _general_ - any time it's an actual individual entry, that swap macro
> doesn't really work.

But it isn't a "swap entry" macro/function.  I did not even attempt
to make it a "swap entry" function.

I made a chop two lists into two and swap the pieces function.

> So the only reason it works here is because you're actually swapping
> the whole list.
>
> But that, in turn, shouldn't be using that "first node" model at all,
> it should use the hlist_head. That would have made it a lot more
> obvious what is actually going on to me.
>
> Now, the comment very much talks about the head case, but the code
> still looks like it's swapping a non-head thing.
>
> I guess the code technically _works_ with "swap two list ends", but
> does that actually really make sense as an operation?

As an operation yes.  Will anyone else want that operation I don't know.

> So I no longer hate how this patch looks, but I wonder if we should
> just make the whole "this node is the *first* node" a bit more
> explicit in both the caller and in the swapping code.
>
> It could be as simple as replacing just the conceptual types and
> names, so instead of some "pnode1" double-indirect node pointer, we'd
> have
>
>         struct hlist_head *left_head = container_of(left->pprev,
> struct hlist_head, first);
>         struct hlist_head *right_head = container_of(right->pprev,
> struct hlist_head, first);
>
> and then the code would do
>
>         rcu_assign_pointer(right_head->first, left);
>         rcu_assign_pointer(left_head->first, right);
>         WRITE_ONCE(left->pprev,  &right_head->first);
>         WRITE_ONCE(right->pprev,  &left_head->first);
>
> which should generate the exact same code, but makes it clear that
> what we're doing is switching the whole hlist when given the first
> entries.
>
> Doesn't that make what it actually does a lot more understandable?

Understandable is a bit subjective. I think having a well defined hlist
operation I can call makes things more understandable.

I think the getting the list head as:
"head = &task->thread_pid->tasks[PIDTYPE_PID];" is more understandable
and less risky than container_of.

My concern and probably unreasonbable as this is a slow path
with getting the list heads after looking up the pid is that it seems
to add a wait for an additional cache line to load before anything can
happen.

The only way I really know to make this code much more understandable is
to remove the lists entirely for this case.  But that is a much larger
change and it is not clear that it makes the kernel code overall better.
I stared at that for a while and it is an interesting follow on but not
something I want or we even can do before exchange_tids is in place.

> The
> *pnode1/pnode2 games are somewhat opaque, but with that type and name
> change and using "container_of()", the code now fairly naturally reads
> as "oh, we're changing the first pointers in the list heads, and
> making the nodes point back to them" .
>
> Again - the current function _works_ with swapping two hlists in the
> middle (not two entries - it swaps the whole list starting at that
> entry!), so your current patch is in some ways "more generic". I'm
> just suggesting that the generic case doesn't make much sense, and
> that the "we know the first entries, swap the lists" actually is what
> the real use is, and writing it as such makes the code easier to
> understand.

Yep.  That is waht I designed it to do.  I sort of went the other
direction when writing this.  I could start with the list heads and swap
the rest of the lists and get the same code.  But it looked like it
would be a little slower to find the hlist_heads, and I couldn't think
of a good name for the function.  So I figured if I was writing a
fucntion for this case I would write one that was convinient.

For understandability that is my real challenge what is a good name
that people can read and understand what is happening for this swapping
function.

> But I'm not going to insist on this, so this is more an RFC. Maybe
> people disagree, and/or have an actual use case for that "break two
> hlists in the middle, swap the ends" that I find unlikely...
>
> (NOTE: My "convert to hlist_head" code _works_ for that case too
> because the code generation is the same! But it would be really really
> confusing for that code to be used for anything but the first entry).

Yes.

I am open to improvements.  Especially in the naming.

Would hlists_swap_heads_rcu be noticably better?

Eric
Eric W. Biederman April 27, 2020, 7:46 p.m. UTC | #16
Thomas Gleixner <tglx@linutronix.de> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>> Using pid_task(find_vpid(N), PIDTYPE_TGID) guarantees that if a task
>> is found it is at that moment the thread group leader.  Which removes
>> the need for the follow on test has_group_leader_pid.
>>
>> I have reorganized the rest of the code in lookup_task for clarity,
>> and created a common return for most of the code.
>
> Sorry, it's way harder to read than the very explicit exits which were
> there before.

My biggest gripe is the gettime and the ordinary !thread case should
be sharing code and they are not.  I know historically why they don't
but for all practical purposes has_group_leader_pid and
thread_group_leader are the same test.


>> The special case for clock_gettime with "pid == gettid" is not my
>> favorite.  I strongly suspect it isn't used as gettid is such a pain,
>> and passing 0 is much easier.  Still it is easier to keep this special
>> case than to do the reasarch that will show it isn't used.
>
> It might be not your favorite, but when I refactored the code I learned
> the hard way that one of the test suites has assumptions that
> clock_gettime(PROCESS) works from any task of a group and not just for
> the group leader. Sure we could fix the test suite, but test code tends
> to be copied ...

Do you know which test suite?
It would be nice to see such surprising code with my own eyes.

I completely agree that clock_gettime(PROCESS) should work for any task
of a group.  I would think anything with a constant like that would just
be passing in 0, which is trivial.  Looking up your threadid seems like
extra work.

Mostly my complaint is that the gettime subcase is an awkward special
case.  Added in 33ab0fec3352 ("posix-timers: Consolidate
posix_cpu_clock_get()") and the only justification for changing the
userspace ABI was that it made things less awkward to combine to
branches of code.

>>  /*
>>   * Functions for validating access to tasks.
>>   */
>> -static struct task_struct *lookup_task(const pid_t pid, bool thread,
>> +static struct task_struct *lookup_task(const pid_t which_pid, bool thread,
>>  				       bool gettime)
>>  {
>>  	struct task_struct *p;
>> +	struct pid *pid;
>>  
>>  	/*
>>  	 * If the encoded PID is 0, then the timer is targeted at current
>>  	 * or the process to which current belongs.
>>  	 */
>> -	if (!pid)
>> +	if (!which_pid)
>>  		return thread ? current : current->group_leader;
>>  
>> -	p = find_task_by_vpid(pid);
>> -	if (!p)
>> -		return p;
>> -
>> -	if (thread)
>> -		return same_thread_group(p, current) ? p : NULL;
>> -
>> -	if (gettime) {
>> +	pid = find_vpid(which_pid);
>> +	if (thread) {
>> +		p = pid_task(pid, PIDTYPE_PID);
>> +		if (p && !same_thread_group(p, current))
>> +			p = NULL;
>> +	} else {
>>  		/*
>>  		 * For clock_gettime(PROCESS) the task does not need to be
>>  		 * the actual group leader. tsk->sighand gives
>> @@ -76,13 +75,13 @@ static struct task_struct *lookup_task(const pid_t pid, bool thread,
>>  		 * reference on it and store the task pointer until the
>>  		 * timer is destroyed.
>
> Btw, this comment is wrong since
>
>      55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task")

It is definitely stale.  It continues to describe the motive for
limiting ourselves to a thread_group_leader.

I am cooking up a patch to tweak that comment, and get replace of
has_group_leader_pid.  Since it is unnecessary I just want to decouple
that work from this patchset.

Eric
Eric W. Biederman April 27, 2020, 8:23 p.m. UTC | #17
ebiederm@xmission.com (Eric W. Biederman) writes:
> Take a good look over the code base and see if I can see any more
> issues like was found in next_tgid and repost the core patches.

Oleg,

I am reading through kernel/ptrace.c and I am seeing a lot of:

	spin_lock(&child->sighand->siglock);

In places where I don't see anything guaranteeing the child is stopped
such as ptrace_freeze_traced.  Are all of those places safe or
do some of the need to be transformed into lock_task_sighand in
case the process is current running?

I might just be reading the code to quickly and missing what keeps the
code from executing exec and changing sighand.

Eric
Linus Torvalds April 27, 2020, 8:27 p.m. UTC | #18
On Mon, Apr 27, 2020 at 7:32 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Would hlists_swap_heads_rcu be noticably better?

Edited out most of the rest because I think we're in agreement and
it's not a huge deal.

And yes, I think it might be nice to just call it "swap_heads()" and
make the naming match what the user wants, so that people who don't
care about the implementation can just guess from the function name.

But I also think that by now it's mostly bikeshedding, and I'm
probably also now biased by the fact that I have looked at that code
and read your explanations so it's all fresh in my mind.

A year from now, when I've forgotten the details, who knows which part
I'd find confusing? ;)

               Linus
Oleg Nesterov April 28, 2020, 6:03 p.m. UTC | #19
On 04/27, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > Eric,
> >
> > I am sick today and can't read the code, but I feel this patch is not
> > right ... please correct me.
>
>
> > So, iiuc when posix_cpu_timer_create() is called and CPUCLOCK_PERTHREAD
> > is false we roughly have
> >
> > 	task = pid_task(pid, PIDTYPE_TGID);			// lookup_task()
> >
> > 	/* WINDOW */
> >
> > 	timer->it.cpu.pid = = get_task_pid(task, PIDTYPE_TGID)	// posix_cpu_timer_create()
> >
> > Now suppose that we race with mt-exec and this "task" is the old leader;
> > it can be release_task()'ed in the WINDOW above and then get_task_pid()
> > will return NULL.
>
> Except it is asking for PIDTYPE_TGID.
>
> task->signal

Ah yes, I knew I missed something...

Oleg.
diff mbox series

Patch

diff --git a/fs/exec.c b/fs/exec.c
index 06b4c550af5d..9b60f927afd7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1186,11 +1186,8 @@  static int de_thread(struct task_struct *tsk)
 
 		/* Become a process group leader with the old leader's pid.
 		 * The old leader becomes a thread of the this thread group.
-		 * Note: The old leader also uses this pid until release_task
-		 *       is called.  Odd but simple and correct.
 		 */
-		tsk->pid = leader->pid;
-		change_pid(tsk, PIDTYPE_PID, task_pid(leader));
+		exchange_tids(tsk, leader);
 		transfer_pid(leader, tsk, PIDTYPE_TGID);
 		transfer_pid(leader, tsk, PIDTYPE_PGID);
 		transfer_pid(leader, tsk, PIDTYPE_SID);
diff --git a/include/linux/pid.h b/include/linux/pid.h
index cc896f0fc4e3..2159ffca63fc 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -102,6 +102,7 @@  extern void attach_pid(struct task_struct *task, enum pid_type);
 extern void detach_pid(struct task_struct *task, enum pid_type);
 extern void change_pid(struct task_struct *task, enum pid_type,
 			struct pid *pid);
+extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index c835b844aca7..4ece32d8791a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -363,6 +363,22 @@  void change_pid(struct task_struct *task, enum pid_type type,
 	attach_pid(task, type);
 }
 
+void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
+{
+	/* pid_links[PIDTYPE_PID].next is always NULL */
+	struct pid *npid = READ_ONCE(ntask->thread_pid);
+	struct pid *opid = READ_ONCE(otask->thread_pid);
+
+	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
+	rcu_assign_pointer(ntask->thread_pid, opid);
+	rcu_assign_pointer(otask->thread_pid, npid);
+	WRITE_ONCE(ntask->pid_links[PIDTYPE_PID].pprev, &opid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(otask->pid_links[PIDTYPE_PID].pprev, &npid->tasks[PIDTYPE_PID].first);
+	WRITE_ONCE(ntask->pid, pid_nr(opid));
+	WRITE_ONCE(otask->pid, pid_nr(npid));
+}
+
 /* transfer_pid is an optimization of attach_pid(new), detach_pid(old) */
 void transfer_pid(struct task_struct *old, struct task_struct *new,
 			   enum pid_type type)