diff mbox series

[v2,2/4] exit: support non-blocking pidfds

Message ID 20200902102130.147672-3-christian.brauner@ubuntu.com
State New
Headers show
Series Support non-blocking pidfds | expand

Commit Message

Christian Brauner Sept. 2, 2020, 10:21 a.m. UTC
Passing a non-blocking pidfd to waitid() currently has no effect, i.e.  is not
supported. There are users which would like to use waitid() on pidfds that are
O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
waitid().
The expected behavior is to have waitid() return -EAGAIN for non-blocking
pidfds and to block for blocking pidfds without needing to perform any
additional checks for flags set on the pidfd before passing it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child process is
ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
loops that handle EAGAIN specially.

It also makes the API more consistent and uniform. In essence, waitid() is
treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
contrast non-blocking pidfds and non-blocking sockets are a setting on an open
file description affecting all threads in the calling process as well as other
processes that hold file descriptors referring to the same open file
description. Both behaviors, per call and per open file description, have
genuine use-cases.

The implementation should be straightforward, we simply raise the WNOHANG flag
when a non-blocking pidfd is passed and when do_wait() returns without finding
an eligible task and the pidfd is non-blocking we set EAGAIN.  If no child
process exists non-blocking pidfd users will continue to see ECHILD but if
child processes exist but have not yet exited users will see EAGAIN.

A concrete use-case that was brought on-list was Josh's async pidfd library.
Ever since the introduction of pidfds and more advanced async io various
programming languages such as Rust have grown support for async event
libraries. These libraries are created to help build epoll-based event loops
around file descriptors. A common pattern is to automatically make all file
descriptors they manage to O_NONBLOCK.

For such libraries the EAGAIN error code is treated specially. When a function
is called that returns EAGAIN the function isn't called again until the event
loop indicates the the file descriptor is ready.  Supporting EAGAIN when
waiting on pidfds makes such libraries just work with little effort.

Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Link: https://github.com/joshtriplett/async-pidfd
Cc: Kees Cook <keescook@chromium.org>
Cc: Sargun Dhillon <sargun@sargun.me>
Cc: Jann Horn <jannh@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Suggested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
/* v2 */
- Oleg Nesterov <oleg@redhat.com>:
  - Remove the eagain_error and simple set to EAGAIN in kernel_waitid() if
    pidfd is non-blocking and no child process has yet exited.
---
 kernel/exit.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

Comments

Oleg Nesterov Sept. 3, 2020, 2:22 p.m. UTC | #1
On 09/02, Christian Brauner wrote:
>
> It also makes the API more consistent and uniform. In essence, waitid() is
> treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
> socket.
> With the addition of support for non-blocking pidfds we support the same
> functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
> for pidfds waitid() supports WNOHANG.

What I personally do not like is that waitid(WNOHANG) returns zero or EAGAIN
depending on f_flags & O_NONBLOCK... This doesn't match recvmsg(MSG_DONTWAIT)
and doesn't look consistent to me.

Nevermind, the patch looks correct and if you think this can really help
user-space I won't argue.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Christian Brauner Sept. 3, 2020, 3:38 p.m. UTC | #2
On Thu, Sep 03, 2020 at 04:22:42PM +0200, Oleg Nesterov wrote:
> On 09/02, Christian Brauner wrote:
> >
> > It also makes the API more consistent and uniform. In essence, waitid() is
> > treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
> > socket.
> > With the addition of support for non-blocking pidfds we support the same
> > functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
> > for pidfds waitid() supports WNOHANG.
> 
> What I personally do not like is that waitid(WNOHANG) returns zero or EAGAIN
> depending on f_flags & O_NONBLOCK... This doesn't match recvmsg(MSG_DONTWAIT)
> and doesn't look consistent to me.

It's not my favorite thing either but async event loops are usually
modeled around EAGAIN so I think this has benefits. Josh can speak more
to that.

Christian
Josh Triplett Sept. 3, 2020, 11:54 p.m. UTC | #3
On Thu, Sep 03, 2020 at 05:38:47PM +0200, Christian Brauner wrote:
> On Thu, Sep 03, 2020 at 04:22:42PM +0200, Oleg Nesterov wrote:
> > On 09/02, Christian Brauner wrote:
> > >
> > > It also makes the API more consistent and uniform. In essence, waitid() is
> > > treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
> > > socket.
> > > With the addition of support for non-blocking pidfds we support the same
> > > functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
> > > for pidfds waitid() supports WNOHANG.
> > 
> > What I personally do not like is that waitid(WNOHANG) returns zero or EAGAIN
> > depending on f_flags & O_NONBLOCK... This doesn't match recvmsg(MSG_DONTWAIT)
> > and doesn't look consistent to me.
> 
> It's not my favorite thing either but async event loops are usually
> modeled around EAGAIN so I think this has benefits. Josh can speak more
> to that.

I wouldn't expect the same application to use both WNOHANG and
O_NONBLOCK, since the latter makes the former unnecessary. I'd have no
objection if WNOHANG continued to have the same "return 0 and you have
to check the structure to figure out what that means" behavior
regardless of the fd flags, for compatibility with an application or
library that expects that behavior with WNOHANG and didn't expect the
return value to change with a non-blocking fd.  waitid could just return
EAGAIN on a non-blocking fd if *not* passed WNOHANG, which would make
pidfd Just Work in non-blocking event loops.
Josh Triplett Sept. 3, 2020, 11:56 p.m. UTC | #4
On Wed, Sep 02, 2020 at 12:21:28PM +0200, Christian Brauner wrote:
> Passing a non-blocking pidfd to waitid() currently has no effect, i.e.  is not
> supported. There are users which would like to use waitid() on pidfds that are
> O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
> waitid().
> The expected behavior is to have waitid() return -EAGAIN for non-blocking
> pidfds and to block for blocking pidfds without needing to perform any
> additional checks for flags set on the pidfd before passing it to waitid().
> Non-blocking pidfds will return EAGAIN from waitid() when no child process is
> ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
> loops that handle EAGAIN specially.
> 
> It also makes the API more consistent and uniform. In essence, waitid() is
> treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
> socket.
> With the addition of support for non-blocking pidfds we support the same
> functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
> for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
> contrast non-blocking pidfds and non-blocking sockets are a setting on an open
> file description affecting all threads in the calling process as well as other
> processes that hold file descriptors referring to the same open file
> description. Both behaviors, per call and per open file description, have
> genuine use-cases.
> 
> The implementation should be straightforward, we simply raise the WNOHANG flag
> when a non-blocking pidfd is passed and when do_wait() returns without finding
> an eligible task and the pidfd is non-blocking we set EAGAIN.  If no child
> process exists non-blocking pidfd users will continue to see ECHILD but if
> child processes exist but have not yet exited users will see EAGAIN.
> 
> A concrete use-case that was brought on-list was Josh's async pidfd library.
> Ever since the introduction of pidfds and more advanced async io various
> programming languages such as Rust have grown support for async event
> libraries. These libraries are created to help build epoll-based event loops
> around file descriptors. A common pattern is to automatically make all file
> descriptors they manage to O_NONBLOCK.
> 
> For such libraries the EAGAIN error code is treated specially. When a function
> is called that returns EAGAIN the function isn't called again until the event
> loop indicates the the file descriptor is ready.  Supporting EAGAIN when
> waiting on pidfds makes such libraries just work with little effort.
> 
> Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
> Link: https://github.com/joshtriplett/async-pidfd
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Sargun Dhillon <sargun@sargun.me>
> Cc: Jann Horn <jannh@google.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> Suggested-by: Josh Triplett <josh@joshtriplett.org>
> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>

With or without the discussed change to WNOHANG behavior for
compatibility:
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

Also, I think you should flip the order of patches 1 and 2, so that
there isn't a one-patch window in kernel history where you can create an
O_NONBLOCK pidfd with pidfd_open but it has no effect. I'd expect
userspace to use pidfd_open accepting or EINVAL-ing the flag as an
indication of whether it'll work.
Christian Brauner Sept. 4, 2020, 10:29 a.m. UTC | #5
On Thu, Sep 03, 2020 at 04:56:59PM -0700, Josh Triplett wrote:
> On Wed, Sep 02, 2020 at 12:21:28PM +0200, Christian Brauner wrote:
> > Passing a non-blocking pidfd to waitid() currently has no effect, i.e.  is not
> > supported. There are users which would like to use waitid() on pidfds that are
> > O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
> > waitid().
> > The expected behavior is to have waitid() return -EAGAIN for non-blocking
> > pidfds and to block for blocking pidfds without needing to perform any
> > additional checks for flags set on the pidfd before passing it to waitid().
> > Non-blocking pidfds will return EAGAIN from waitid() when no child process is
> > ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
> > loops that handle EAGAIN specially.
> > 
> > It also makes the API more consistent and uniform. In essence, waitid() is
> > treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
> > socket.
> > With the addition of support for non-blocking pidfds we support the same
> > functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
> > for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
> > contrast non-blocking pidfds and non-blocking sockets are a setting on an open
> > file description affecting all threads in the calling process as well as other
> > processes that hold file descriptors referring to the same open file
> > description. Both behaviors, per call and per open file description, have
> > genuine use-cases.
> > 
> > The implementation should be straightforward, we simply raise the WNOHANG flag
> > when a non-blocking pidfd is passed and when do_wait() returns without finding
> > an eligible task and the pidfd is non-blocking we set EAGAIN.  If no child
> > process exists non-blocking pidfd users will continue to see ECHILD but if
> > child processes exist but have not yet exited users will see EAGAIN.
> > 
> > A concrete use-case that was brought on-list was Josh's async pidfd library.
> > Ever since the introduction of pidfds and more advanced async io various
> > programming languages such as Rust have grown support for async event
> > libraries. These libraries are created to help build epoll-based event loops
> > around file descriptors. A common pattern is to automatically make all file
> > descriptors they manage to O_NONBLOCK.
> > 
> > For such libraries the EAGAIN error code is treated specially. When a function
> > is called that returns EAGAIN the function isn't called again until the event
> > loop indicates the the file descriptor is ready.  Supporting EAGAIN when
> > waiting on pidfds makes such libraries just work with little effort.
> > 
> > Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
> > Link: https://github.com/joshtriplett/async-pidfd
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Sargun Dhillon <sargun@sargun.me>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
> > Suggested-by: Josh Triplett <josh@joshtriplett.org>
> > Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
> 
> With or without the discussed change to WNOHANG behavior for
> compatibility:
> Reviewed-by: Josh Triplett <josh@joshtriplett.org>

I think that WNOHANG compatibility change might be a good idea. So I've
changed this to:

	ret = do_wait(&wo);
	if (!ret && !(options & WNOHANG) && (f_flags & O_NONBLOCK))
		ret = -EAGAIN;

> 
> Also, I think you should flip the order of patches 1 and 2, so that
> there isn't a one-patch window in kernel history where you can create an
> O_NONBLOCK pidfd with pidfd_open but it has no effect. I'd expect
> userspace to use pidfd_open accepting or EINVAL-ing the flag as an
> indication of whether it'll work.

Good point! I've changed the order now.

Thanks!
Christian
diff mbox series

Patch

diff --git a/kernel/exit.c b/kernel/exit.c
index 733e80f334e7..254ea3efe954 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -1474,7 +1474,7 @@  static long do_wait(struct wait_opts *wo)
 	return retval;
 }
 
-static struct pid *pidfd_get_pid(unsigned int fd)
+static struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
 {
 	struct fd f;
 	struct pid *pid;
@@ -1484,8 +1484,10 @@  static struct pid *pidfd_get_pid(unsigned int fd)
 		return ERR_PTR(-EBADF);
 
 	pid = pidfd_pid(f.file);
-	if (!IS_ERR(pid))
+	if (!IS_ERR(pid)) {
 		get_pid(pid);
+		*flags = f.file->f_flags;
+	}
 
 	fdput(f);
 	return pid;
@@ -1498,6 +1500,7 @@  static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
 	struct pid *pid = NULL;
 	enum pid_type type;
 	long ret;
+	unsigned int f_flags = 0;
 
 	if (options & ~(WNOHANG|WNOWAIT|WEXITED|WSTOPPED|WCONTINUED|
 			__WNOTHREAD|__WCLONE|__WALL))
@@ -1531,9 +1534,10 @@  static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
 		if (upid < 0)
 			return -EINVAL;
 
-		pid = pidfd_get_pid(upid);
+		pid = pidfd_get_pid(upid, &f_flags);
 		if (IS_ERR(pid))
 			return PTR_ERR(pid);
+
 		break;
 	default:
 		return -EINVAL;
@@ -1544,7 +1548,12 @@  static long kernel_waitid(int which, pid_t upid, struct waitid_info *infop,
 	wo.wo_flags	= options;
 	wo.wo_info	= infop;
 	wo.wo_rusage	= ru;
+	if (f_flags & O_NONBLOCK)
+		wo.wo_flags |= WNOHANG;
+
 	ret = do_wait(&wo);
+	if (!ret && (f_flags & O_NONBLOCK))
+		ret = -EAGAIN;
 
 	put_pid(pid);
 	return ret;