diff mbox series

[RFC,3/4] pidfd: improve uapi when task isn't found

Message ID 20250403-work-pidfd-fixes-v1-3-a123b6ed6716@kernel.org (mailing list archive)
State New
Headers show
Series pidfd: improve uapi when task isn't found | expand

Commit Message

Christian Brauner April 3, 2025, 2:09 p.m. UTC
We currently report EINVAL whenever a struct pid has no tasked attached
anymore thereby conflating two concepts:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
    actually references a struct pid that isn't used as a thread-group
    leader.

This is causing issues for non-threaded workloads as in [1].

This patch tries to allow userspace to distinguish between (1) and (2).
This is racy of course but that shouldn't matter.

Link: https://github.com/systemd/systemd/pull/36982 [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/fork.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

Comments

Oleg Nesterov April 4, 2025, 12:37 p.m. UTC | #1
On 04/03, Christian Brauner wrote:
>
> We currently report EINVAL whenever a struct pid has no tasked attached
> anymore thereby conflating two concepts:
>
> (1) The task has already been reaped.
> (2) The caller requested a pidfd for a thread-group leader but the pid
>     actually references a struct pid that isn't used as a thread-group
>     leader.
>
> This is causing issues for non-threaded workloads as in [1].
>
> This patch tries to allow userspace to distinguish between (1) and (2).
> This is racy of course but that shouldn't matter.
>
> Link: https://github.com/systemd/systemd/pull/36982 [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>

For this series:

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


But I have a couple of cosmetic nits...

>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
> -	bool thread = flags & PIDFD_THREAD;
> +	int err = 0;
>
> -	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
> -		return -EINVAL;
> +	if (!(flags & PIDFD_THREAD)) {
> +		/*
> +		 * If this is struct pid isn't used as a thread-group
> +		 * leader pid but the caller requested to create a
> +		 * thread-group leader pidfd then report ENOENT to the
> +		 * caller as a hint.
> +		 */
> +		if (!pid_has_task(pid, PIDTYPE_TGID))
> +			err = -ENOENT;
> +	}
> +
> +	/*
> +	 * If this wasn't a thread-group leader struct pid or the task
> +	 * got reaped in the meantime report -ESRCH to userspace.
> +	 *
> +	 * This is racy of course. This could've not been a thread-group
> +	 * leader struct pid and we set ENOENT above but in the meantime
> +	 * the task got reaped. Or there was a multi-threaded-exec by a
> +	 * subthread and we were a thread-group leader but now got
> +	 * killed.

The comment about the multi-threaded-exec looks a bit misleading to me.
If this pid is a group-leader-pid and we race with de_thread() which does

		exchange_tids(tsk, leader);
		transfer_pid(leader, tsk, PIDTYPE_TGID);

nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or
pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this
transition.

hlists_swap_heads_rcu() or hlist_replace_rcu() can't make
hlist_head->first == NULL during this transition...

Or I misunderstood the comment?

And... the code looks a bit overcomplicated to me, why not simply

	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
	{
		if (!pid_has_task(pid, PIDTYPE_PID))
			return -ESRCH;

		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
			return -ENOENT;

		return __pidfd_prepare(pid, flags, ret);
	}

? Of course, the comments should stay.

But again, this is cosmetic/subjective, please do what you like more.

Oleg.
Christian Brauner April 4, 2025, 1:38 p.m. UTC | #2
On Fri, Apr 04, 2025 at 02:37:38PM +0200, Oleg Nesterov wrote:
> On 04/03, Christian Brauner wrote:
> >
> > We currently report EINVAL whenever a struct pid has no tasked attached
> > anymore thereby conflating two concepts:
> >
> > (1) The task has already been reaped.
> > (2) The caller requested a pidfd for a thread-group leader but the pid
> >     actually references a struct pid that isn't used as a thread-group
> >     leader.
> >
> > This is causing issues for non-threaded workloads as in [1].
> >
> > This patch tries to allow userspace to distinguish between (1) and (2).
> > This is racy of course but that shouldn't matter.
> >
> > Link: https://github.com/systemd/systemd/pull/36982 [1]
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> For this series:
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 
> 
> But I have a couple of cosmetic nits...
> 
> >  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> >  {
> > -	bool thread = flags & PIDFD_THREAD;
> > +	int err = 0;
> >
> > -	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
> > -		return -EINVAL;
> > +	if (!(flags & PIDFD_THREAD)) {
> > +		/*
> > +		 * If this is struct pid isn't used as a thread-group
> > +		 * leader pid but the caller requested to create a
> > +		 * thread-group leader pidfd then report ENOENT to the
> > +		 * caller as a hint.
> > +		 */
> > +		if (!pid_has_task(pid, PIDTYPE_TGID))
> > +			err = -ENOENT;
> > +	}
> > +
> > +	/*
> > +	 * If this wasn't a thread-group leader struct pid or the task
> > +	 * got reaped in the meantime report -ESRCH to userspace.
> > +	 *
> > +	 * This is racy of course. This could've not been a thread-group
> > +	 * leader struct pid and we set ENOENT above but in the meantime
> > +	 * the task got reaped. Or there was a multi-threaded-exec by a
> > +	 * subthread and we were a thread-group leader but now got
> > +	 * killed.
> 
> The comment about the multi-threaded-exec looks a bit misleading to me.
> If this pid is a group-leader-pid and we race with de_thread() which does
> 
> 		exchange_tids(tsk, leader);
> 		transfer_pid(leader, tsk, PIDTYPE_TGID);
> 
> nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or
> pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this
> transition.
> 
> hlists_swap_heads_rcu() or hlist_replace_rcu() can't make
> hlist_head->first == NULL during this transition...

Good point.

> 
> Or I misunderstood the comment?
> 
> And... the code looks a bit overcomplicated to me, why not simply
> 
> 	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> 	{
> 		if (!pid_has_task(pid, PIDTYPE_PID))
> 			return -ESRCH;
> 
> 		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
> 			return -ENOENT;

I thought that checking PIDTYPE_PID first could cause misleading results
where we report ENOENT where we should report ESRCH: If the task was
released after the successful PIDTYPE_PID check for a pid that was never
a thread-group leader we report ENOENT. That's what I reversed the
check. But I can adapt that to you scheme. I mostly wanted a place to
put the comments.

> 
> 		return __pidfd_prepare(pid, flags, ret);
> 	}
> 
> ? Of course, the comments should stay.
> 
> But again, this is cosmetic/subjective, please do what you like more.
> 
> Oleg.
>
Oleg Nesterov April 4, 2025, 2:53 p.m. UTC | #3
On 04/04, Christian Brauner wrote:
>
> On Fri, Apr 04, 2025 at 02:37:38PM +0200, Oleg Nesterov wrote:
> > And... the code looks a bit overcomplicated to me, why not simply
> >
> > 	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> > 	{
> > 		if (!pid_has_task(pid, PIDTYPE_PID))
> > 			return -ESRCH;
> >
> > 		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
> > 			return -ENOENT;
>
> I thought that checking PIDTYPE_PID first could cause misleading results
> where we report ENOENT where we should report ESRCH: If the task was
> released after the successful PIDTYPE_PID check for a pid that was never
> a thread-group leader we report ENOENT.

Hmm... but the code above can only return ENOENT if !(flags & PIDFD_THREAD),
so in this case -ENOENT is correct?

I guess -ENOENT would be wrong if this pid _was_ a leader pid and we
race with __unhash_process() which does

	detach_pid(post->pids, p, PIDTYPE_PID);
	if (group_dead)
		detach_pid(post->pids, p, PIDTYPE_TGID);

but without tasklist_lock (or additional barries in both pidfd_prepare() and
__unhash_process() pidfd_prepare() can see the result of these 2 detach_pid()'s
in any order anyway. So I don't think the code above is "more" racy.

Although perhaps we can rely on the fact the the 1st detach_pid(PIDTYPE_PID)
does wake_up(pid->wait_pidfd) and use pid->wait_pidfd->lock to avoid the
races, not sure...

But,

> But I can adapt that to you scheme.

Again, up to you, whatever you prefer.

Oleg.
diff mbox series

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index 182ec2e9087d..0fe54fcd11b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2108,10 +2108,35 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	bool thread = flags & PIDFD_THREAD;
+	int err = 0;
 
-	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
-		return -EINVAL;
+	if (!(flags & PIDFD_THREAD)) {
+		/*
+		 * If this is struct pid isn't used as a thread-group
+		 * leader pid but the caller requested to create a
+		 * thread-group leader pidfd then report ENOENT to the
+		 * caller as a hint.
+		 */
+		if (!pid_has_task(pid, PIDTYPE_TGID))
+			err = -ENOENT;
+	}
+
+	/*
+	 * If this wasn't a thread-group leader struct pid or the task
+	 * got reaped in the meantime report -ESRCH to userspace.
+	 *
+	 * This is racy of course. This could've not been a thread-group
+	 * leader struct pid and we set ENOENT above but in the meantime
+	 * the task got reaped. Or there was a multi-threaded-exec by a
+	 * subthread and we were a thread-group leader but now got
+	 * killed. All of that doesn't matter since the task has already
+	 * been reaped that distinction is meaningless to userspace so
+	 * just report ESRCH.
+	 */
+	if (!pid_has_task(pid, PIDTYPE_PID))
+		err = -ESRCH;
+	if (err)
+		return err;
 
 	return __pidfd_prepare(pid, flags, ret);
 }