PIDFD_THREAD behavior for thread-group leaders

Message ID	nhoaiykqnoid3df3ckmqqgycbjqtd2rutrpeat25j4bbm7tbjl@tpncnt7cp26n (mailing list archive)
State	New
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4FB8720F066 for <linux-fsdevel@vger.kernel.org>; Thu, 6 Mar 2025 11:41:30 +0000 (UTC) Date: Thu, 6 Mar 2025 12:41:26 +0100 From: Christian Brauner <brauner@kernel.org> To: Oleg Nesterov <oleg@redhat.com> Cc: linux-fsdevel@vger.kernel.org, Jeff Layton <jlayton@kernel.org>, Lennart Poettering <lennart@poettering.net>, Daan De Meyer <daan.j.demeyer@gmail.com>, Mike Yuan <me@yhndnzj.com> Subject: PIDFD_THREAD behavior for thread-group leaders Message-ID: <nhoaiykqnoid3df3ckmqqgycbjqtd2rutrpeat25j4bbm7tbjl@tpncnt7cp26n> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline
Series	PIDFD_THREAD behavior for thread-group leaders \| expand PIDFD_THREAD behavior for thread-group leaders

Message ID

nhoaiykqnoid3df3ckmqqgycbjqtd2rutrpeat25j4bbm7tbjl@tpncnt7cp26n (mailing list archive)

State

New

Headers

Date: Thu, 6 Mar 2025 12:41:26 +0100
From: Christian Brauner <brauner@kernel.org>
To: Oleg Nesterov <oleg@redhat.com>
Cc: linux-fsdevel@vger.kernel.org, Jeff Layton <jlayton@kernel.org>,
	Lennart Poettering <lennart@poettering.net>,
 Daan De Meyer <daan.j.demeyer@gmail.com>,
	Mike Yuan <me@yhndnzj.com>
Subject: PIDFD_THREAD behavior for thread-group leaders
Message-ID: <nhoaiykqnoid3df3ckmqqgycbjqtd2rutrpeat25j4bbm7tbjl@tpncnt7cp26n>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline

Series

PIDFD_THREAD behavior for thread-group leaders | expand

Commit Message

Christian Brauner March 6, 2025, 11:41 a.m. UTC

Oleg,

I've been thinking about the multi-threaded exec case where a
non-thread-group leader task execs and assumes the thread-group leaders
struct pid.

Back when we implemented support for PIDFD_THREAD we ended up with the
decision that if userspace holds:

pidfd_leader_thread = pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD)

that exit notification is not strictly defined if a non-thread-group
leader thread execs: If poll is called before the exec happened, then an
exit notification may be observed and if aftewards no exit notification
is generated for the old thread-group leader. Of if exit for the old
thread-group leader was observed but poll is called again then it would
block again.

I was wondering why the following snippet wouldn't work to ensure that
PIDFD_THREAD pidfds for thread-group leaders wouldn't be woken with
spurious exits:


Because that would seem more consistent to me. The downside would be
that if userspace performed a series of multi-threaded exec for
non-thread-group leader threads then waiters wouldn't get woken. But I
think that's probably ok.

To handle this case we could later think about whether we can instead
start generating a separate poll (POLLPRI?) event when exec happens.

I'm probably missing something very obvious why that won't work.

Christian

Comments

Oleg Nesterov March 6, 2025, 12:17 p.m. UTC | #1

On 03/06, Christian Brauner wrote:
>
> Back when we implemented support for PIDFD_THREAD we ended up with the
> decision that if userspace holds:
>
> pidfd_leader_thread = pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD)
>
> that exit notification is not strictly defined if a non-thread-group
> leader thread execs:

Yes, this was even documented in commit 64bef697d33b ...

> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -745,8 +745,11 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
>         /*
>          * sub-thread or delay_group_leader(), wake up the
>          * PIDFD_THREAD waiters.
> +        *
> +        * The thread-group leader will be taken over by the execing
> +        * task so don't cause spurious wakeups.
>          */
> -       if (!thread_group_empty(tsk))
> +       if (!thread_group_empty(tsk) && (tsk->signal->notify_count >= 0))
>                 do_notify_pidfd(tsk);
>
>         if (unlikely(tsk->ptrace)) {

perhaps... but this won't help if the leader exits and that another
thread does exec?

From the changelog

	Perhaps we can improve this behaviour later, pidfd_poll()
	can probably take sig->group_exec_task into account. But
	this doesn't really differ from the case when the leader
	exits before other threads (so pidfd_poll() succeeds) and
	then another thread execs and pidfd_poll() will block again.

so I am not sure what can we do.

I'll try to think more later, but I can't promise much :(

Oleg.

Christian Brauner March 6, 2025, 12:48 p.m. UTC | #2

On Thu, Mar 06, 2025 at 01:17:13PM +0100, Oleg Nesterov wrote:
> On 03/06, Christian Brauner wrote:
> >
> > Back when we implemented support for PIDFD_THREAD we ended up with the
> > decision that if userspace holds:
> >
> > pidfd_leader_thread = pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD)
> >
> > that exit notification is not strictly defined if a non-thread-group
> > leader thread execs:
> 
> Yes, this was even documented in commit 64bef697d33b ...

Yeah, I'm aware I was just revisiting this decision.

> 
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -745,8 +745,11 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
> >         /*
> >          * sub-thread or delay_group_leader(), wake up the
> >          * PIDFD_THREAD waiters.
> > +        *
> > +        * The thread-group leader will be taken over by the execing
> > +        * task so don't cause spurious wakeups.
> >          */
> > -       if (!thread_group_empty(tsk))
> > +       if (!thread_group_empty(tsk) && (tsk->signal->notify_count >= 0))
> >                 do_notify_pidfd(tsk);
> >
> >         if (unlikely(tsk->ptrace)) {
> 
> perhaps... but this won't help if the leader exits and that another
> thread does exec?
> 
> >From the changelog
> 
> 	Perhaps we can improve this behaviour later, pidfd_poll()
> 	can probably take sig->group_exec_task into account. But
> 	this doesn't really differ from the case when the leader
> 	exits before other threads (so pidfd_poll() succeeds) and
> 	then another thread execs and pidfd_poll() will block again.
> 
> so I am not sure what can we do.

I think early thread-group leader exit is a bug in the userspace
program whereas multi-threaded exec is well-defined (if ugly).

To detect early-thread-group leader exec userspace could do:

pidfd_leader_thread = pidfd_open(<thread-group-leader-pid>, 0)
pidfd_thread = pidfd_open(<thread-group-leader-pid>, PIDFD_THREAD)

and then they need to add both file descriptors to the poll instance.

If proper multi-threaded exec happens no notification will be generated
on either file descriptor.

However, if the thread-group leader exits prematurely then a
notification will be generated on pidfd_thread allowing detection of
malformed behavior.

So the premature thread-group leader exec thing is probably something we
can ignore as a real use-case.

> 
> I'll try to think more later, but I can't promise much :(
> 
> Oleg.
>

diff --git a/kernel/exit.c b/kernel/exit.c
index 9916305e34d3..b79ded1b3bf5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -745,8 +745,11 @@  static void exit_notify(struct task_struct *tsk, int group_dead)
        /*
         * sub-thread or delay_group_leader(), wake up the
         * PIDFD_THREAD waiters.
+        *
+        * The thread-group leader will be taken over by the execing
+        * task so don't cause spurious wakeups.
         */
-       if (!thread_group_empty(tsk))
+       if (!thread_group_empty(tsk) && (tsk->signal->notify_count >= 0))
                do_notify_pidfd(tsk);

        if (unlikely(tsk->ptrace)) {

PIDFD_THREAD behavior for thread-group leaders

Commit Message

Comments

Patch