Message ID | 20250403-work-pidfd-fixes-v1-3-a123b6ed6716@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | pidfd: improve uapi when task isn't found | expand |
On 04/03, Christian Brauner wrote: > > We currently report EINVAL whenever a struct pid has no tasked attached > anymore thereby conflating two concepts: > > (1) The task has already been reaped. > (2) The caller requested a pidfd for a thread-group leader but the pid > actually references a struct pid that isn't used as a thread-group > leader. > > This is causing issues for non-threaded workloads as in [1]. > > This patch tries to allow userspace to distinguish between (1) and (2). > This is racy of course but that shouldn't matter. > > Link: https://github.com/systemd/systemd/pull/36982 [1] > Signed-off-by: Christian Brauner <brauner@kernel.org> For this series: Reviewed-by: Oleg Nesterov <oleg@redhat.com> But I have a couple of cosmetic nits... > int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) > { > - bool thread = flags & PIDFD_THREAD; > + int err = 0; > > - if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID)) > - return -EINVAL; > + if (!(flags & PIDFD_THREAD)) { > + /* > + * If this is struct pid isn't used as a thread-group > + * leader pid but the caller requested to create a > + * thread-group leader pidfd then report ENOENT to the > + * caller as a hint. > + */ > + if (!pid_has_task(pid, PIDTYPE_TGID)) > + err = -ENOENT; > + } > + > + /* > + * If this wasn't a thread-group leader struct pid or the task > + * got reaped in the meantime report -ESRCH to userspace. > + * > + * This is racy of course. This could've not been a thread-group > + * leader struct pid and we set ENOENT above but in the meantime > + * the task got reaped. Or there was a multi-threaded-exec by a > + * subthread and we were a thread-group leader but now got > + * killed. The comment about the multi-threaded-exec looks a bit misleading to me. If this pid is a group-leader-pid and we race with de_thread() which does exchange_tids(tsk, leader); transfer_pid(leader, tsk, PIDTYPE_TGID); nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this transition. hlists_swap_heads_rcu() or hlist_replace_rcu() can't make hlist_head->first == NULL during this transition... Or I misunderstood the comment? And... the code looks a bit overcomplicated to me, why not simply int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) { if (!pid_has_task(pid, PIDTYPE_PID)) return -ESRCH; if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID)) return -ENOENT; return __pidfd_prepare(pid, flags, ret); } ? Of course, the comments should stay. But again, this is cosmetic/subjective, please do what you like more. Oleg.
On Fri, Apr 04, 2025 at 02:37:38PM +0200, Oleg Nesterov wrote: > On 04/03, Christian Brauner wrote: > > > > We currently report EINVAL whenever a struct pid has no tasked attached > > anymore thereby conflating two concepts: > > > > (1) The task has already been reaped. > > (2) The caller requested a pidfd for a thread-group leader but the pid > > actually references a struct pid that isn't used as a thread-group > > leader. > > > > This is causing issues for non-threaded workloads as in [1]. > > > > This patch tries to allow userspace to distinguish between (1) and (2). > > This is racy of course but that shouldn't matter. > > > > Link: https://github.com/systemd/systemd/pull/36982 [1] > > Signed-off-by: Christian Brauner <brauner@kernel.org> > > For this series: > > Reviewed-by: Oleg Nesterov <oleg@redhat.com> > > > But I have a couple of cosmetic nits... > > > int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) > > { > > - bool thread = flags & PIDFD_THREAD; > > + int err = 0; > > > > - if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID)) > > - return -EINVAL; > > + if (!(flags & PIDFD_THREAD)) { > > + /* > > + * If this is struct pid isn't used as a thread-group > > + * leader pid but the caller requested to create a > > + * thread-group leader pidfd then report ENOENT to the > > + * caller as a hint. > > + */ > > + if (!pid_has_task(pid, PIDTYPE_TGID)) > > + err = -ENOENT; > > + } > > + > > + /* > > + * If this wasn't a thread-group leader struct pid or the task > > + * got reaped in the meantime report -ESRCH to userspace. > > + * > > + * This is racy of course. This could've not been a thread-group > > + * leader struct pid and we set ENOENT above but in the meantime > > + * the task got reaped. Or there was a multi-threaded-exec by a > > + * subthread and we were a thread-group leader but now got > > + * killed. > > The comment about the multi-threaded-exec looks a bit misleading to me. > If this pid is a group-leader-pid and we race with de_thread() which does > > exchange_tids(tsk, leader); > transfer_pid(leader, tsk, PIDTYPE_TGID); > > nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or > pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this > transition. > > hlists_swap_heads_rcu() or hlist_replace_rcu() can't make > hlist_head->first == NULL during this transition... Good point. > > Or I misunderstood the comment? > > And... the code looks a bit overcomplicated to me, why not simply > > int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) > { > if (!pid_has_task(pid, PIDTYPE_PID)) > return -ESRCH; > > if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID)) > return -ENOENT; I thought that checking PIDTYPE_PID first could cause misleading results where we report ENOENT where we should report ESRCH: If the task was released after the successful PIDTYPE_PID check for a pid that was never a thread-group leader we report ENOENT. That's what I reversed the check. But I can adapt that to you scheme. I mostly wanted a place to put the comments. > > return __pidfd_prepare(pid, flags, ret); > } > > ? Of course, the comments should stay. > > But again, this is cosmetic/subjective, please do what you like more. > > Oleg. >
On 04/04, Christian Brauner wrote: > > On Fri, Apr 04, 2025 at 02:37:38PM +0200, Oleg Nesterov wrote: > > And... the code looks a bit overcomplicated to me, why not simply > > > > int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) > > { > > if (!pid_has_task(pid, PIDTYPE_PID)) > > return -ESRCH; > > > > if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID)) > > return -ENOENT; > > I thought that checking PIDTYPE_PID first could cause misleading results > where we report ENOENT where we should report ESRCH: If the task was > released after the successful PIDTYPE_PID check for a pid that was never > a thread-group leader we report ENOENT. Hmm... but the code above can only return ENOENT if !(flags & PIDFD_THREAD), so in this case -ENOENT is correct? I guess -ENOENT would be wrong if this pid _was_ a leader pid and we race with __unhash_process() which does detach_pid(post->pids, p, PIDTYPE_PID); if (group_dead) detach_pid(post->pids, p, PIDTYPE_TGID); but without tasklist_lock (or additional barries in both pidfd_prepare() and __unhash_process() pidfd_prepare() can see the result of these 2 detach_pid()'s in any order anyway. So I don't think the code above is "more" racy. Although perhaps we can rely on the fact the the 1st detach_pid(PIDTYPE_PID) does wake_up(pid->wait_pidfd) and use pid->wait_pidfd->lock to avoid the races, not sure... But, > But I can adapt that to you scheme. Again, up to you, whatever you prefer. Oleg.
diff --git a/kernel/fork.c b/kernel/fork.c index 182ec2e9087d..0fe54fcd11b3 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2108,10 +2108,35 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re */ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret) { - bool thread = flags & PIDFD_THREAD; + int err = 0; - if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID)) - return -EINVAL; + if (!(flags & PIDFD_THREAD)) { + /* + * If this is struct pid isn't used as a thread-group + * leader pid but the caller requested to create a + * thread-group leader pidfd then report ENOENT to the + * caller as a hint. + */ + if (!pid_has_task(pid, PIDTYPE_TGID)) + err = -ENOENT; + } + + /* + * If this wasn't a thread-group leader struct pid or the task + * got reaped in the meantime report -ESRCH to userspace. + * + * This is racy of course. This could've not been a thread-group + * leader struct pid and we set ENOENT above but in the meantime + * the task got reaped. Or there was a multi-threaded-exec by a + * subthread and we were a thread-group leader but now got + * killed. All of that doesn't matter since the task has already + * been reaped that distinction is meaningless to userspace so + * just report ESRCH. + */ + if (!pid_has_task(pid, PIDTYPE_PID)) + err = -ESRCH; + if (err) + return err; return __pidfd_prepare(pid, flags, ret); }
We currently report EINVAL whenever a struct pid has no tasked attached anymore thereby conflating two concepts: (1) The task has already been reaped. (2) The caller requested a pidfd for a thread-group leader but the pid actually references a struct pid that isn't used as a thread-group leader. This is causing issues for non-threaded workloads as in [1]. This patch tries to allow userspace to distinguish between (1) and (2). This is racy of course but that shouldn't matter. Link: https://github.com/systemd/systemd/pull/36982 [1] Signed-off-by: Christian Brauner <brauner@kernel.org> --- kernel/fork.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-)