[7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT

Message ID	20250207173639.884745-8-axboe@kernel.dk (mailing list archive)
State	New
Headers	show Received: from mail-io1-f49.google.com (mail-io1-f49.google.com [209.85.166.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 96D201A3144 for <io-uring@vger.kernel.org>; Fri, 7 Feb 2025 17:36:53 +0000 (UTC) From: Jens Axboe <axboe@kernel.dk> To: io-uring@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, brauner@kernel.org, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT Date: Fri, 7 Feb 2025 10:32:30 -0700 Message-ID: <20250207173639.884745-8-axboe@kernel.dk> In-Reply-To: <20250207173639.884745-1-axboe@kernel.dk> References: <20250207173639.884745-1-axboe@kernel.dk> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	io_uring epoll wait support \| expand [PATCHSET,v3,0/7] io_uring epoll wait support [1/7] eventpoll: abstract out ep_try_send_events() helper [2/7] eventpoll: abstract out parameter sanity checking [3/7] eventpoll: add epoll_queue() interface [4/7] eventpoll: add helper to remove wait entry from wait queue head [5/7] io_uring/epoll: remove CONFIG_EPOLL guards [6/7] io_uring/poll: pull ownership handling into poll.h [7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT

Message ID

20250207173639.884745-8-axboe@kernel.dk (mailing list archive)

State

New

Headers

From: Jens Axboe <axboe@kernel.dk>
To: io-uring@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org,
	brauner@kernel.org,
	Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT
Date: Fri,  7 Feb 2025 10:32:30 -0700
Message-ID: <20250207173639.884745-8-axboe@kernel.dk>
In-Reply-To: <20250207173639.884745-1-axboe@kernel.dk>
References: <20250207173639.884745-1-axboe@kernel.dk>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

io_uring epoll wait support | expand

Commit Message

Jens Axboe Feb. 7, 2025, 5:32 p.m. UTC

For existing epoll event loops that can't fully convert to io_uring,
the used approach is usually to add the io_uring fd to the epoll
instance and use epoll_wait() to wait on both "legacy" and io_uring
events. While this work, it isn't optimal as:

1) epoll_wait() is pretty limited in what it can do. It does not support
   partial reaping of events, or waiting on a batch of events.

2) When an io_uring ring is added to an epoll instance, it activates the
   io_uring "I'm being polled" logic which slows things down.

Rather than use this approach, with EPOLL_WAIT support added to io_uring,
event loops can use the normal io_uring wait logic for everything, as
long as an epoll wait request has been armed with io_uring.

Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this
is an async request. Waiting on io_uring events in general has various
timeout parameters, and those are the ones that should be used when
waiting on any kind of request. If events are immediately available for
reaping, then This opcode will return those immediately. If none are
available, then it will post an async completion when they become
available.

cqe->res will contain either an error code (< 0 value) for a malformed
request, invalid epoll instance, etc. It will return a positive result
indicating how many events were reaped.

IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring
cancelation infrastructure. The poll logic for managing ownership is
adopted to guard the epoll side too.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |   4 +
 include/uapi/linux/io_uring.h  |   1 +
 io_uring/cancel.c              |   5 ++
 io_uring/epoll.c               | 143 +++++++++++++++++++++++++++++++++
 io_uring/epoll.h               |  22 +++++
 io_uring/io_uring.c            |   5 ++
 io_uring/opdef.c               |  14 ++++
 7 files changed, 194 insertions(+)

Comments

Pavel Begunkov Feb. 8, 2025, 11:27 p.m. UTC | #1

On 2/7/25 17:32, Jens Axboe wrote:
> For existing epoll event loops that can't fully convert to io_uring,
> the used approach is usually to add the io_uring fd to the epoll
> instance and use epoll_wait() to wait on both "legacy" and io_uring
> events. While this work, it isn't optimal as:
> 
> 1) epoll_wait() is pretty limited in what it can do. It does not support
>     partial reaping of events, or waiting on a batch of events.
> 
> 2) When an io_uring ring is added to an epoll instance, it activates the
>     io_uring "I'm being polled" logic which slows things down.
> 
> Rather than use this approach, with EPOLL_WAIT support added to io_uring,
> event loops can use the normal io_uring wait logic for everything, as
> long as an epoll wait request has been armed with io_uring.
> 
> Note that IORING_OP_EPOLL_WAIT does NOT take a timeout value, as this
> is an async request. Waiting on io_uring events in general has various
> timeout parameters, and those are the ones that should be used when
> waiting on any kind of request. If events are immediately available for
> reaping, then This opcode will return those immediately. If none are
> available, then it will post an async completion when they become
> available.
> 
> cqe->res will contain either an error code (< 0 value) for a malformed
> request, invalid epoll instance, etc. It will return a positive result
> indicating how many events were reaped.
> 
> IORING_OP_EPOLL_WAIT requests may be canceled using the normal io_uring
> cancelation infrastructure. The poll logic for managing ownership is
> adopted to guard the epoll side too.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   include/linux/io_uring_types.h |   4 +
>   include/uapi/linux/io_uring.h  |   1 +
>   io_uring/cancel.c              |   5 ++
>   io_uring/epoll.c               | 143 +++++++++++++++++++++++++++++++++
>   io_uring/epoll.h               |  22 +++++
>   io_uring/io_uring.c            |   5 ++
>   io_uring/opdef.c               |  14 ++++
>   7 files changed, 194 insertions(+)
> 
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index e2fef264ff8b..031ba708a81d 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -369,6 +369,10 @@ struct io_ring_ctx {
...
> +bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
> +			      bool cancel_all)
> +{
> +	return io_cancel_remove_all(ctx, tctx, &ctx->epoll_list, cancel_all, __io_epoll_wait_cancel);
> +}
> +
> +int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
> +			 unsigned int issue_flags)
> +{
> +	return io_cancel_remove(ctx, cd, issue_flags, &ctx->epoll_list, __io_epoll_wait_cancel);
> +}
> +
> +static void io_epoll_retry(struct io_kiocb *req, struct io_tw_state *ts)
> +{
> +	int v;
> +
> +	do {
> +		v = atomic_read(&req->poll_refs);
> +		if (unlikely(v != 1)) {
> +			if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
> +				return;
> +			if (v & IO_POLL_CANCEL_FLAG) {
> +				__io_epoll_cancel(req);
> +				return;
> +			}
> +		}
> +		v &= IO_POLL_REF_MASK;
> +	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);

I actually looked up the epoll code this time. If we disregard
cancellations, you have only 1 wait entry, which should've been removed
from the queue by io_epoll_wait_fn(), in which case the entire loop is
doing nothing as there is no one to race with. ->hash_node is the only
shared part, but it's sync'ed by the mutex.

As for cancellation, epoll_wait_remove() also removes the entry, and
you can rely on it to tell if the entry was removed inside, from
which you derive if you're the current owner.

Maybe this handling might be useful for the multishot mode, perhaps
along the lines of:

io_epoll_retry()
{
	do {
		res = epoll_get_events();
		if (one_shot || cancel) {
			wq_remove();
			unhash();
			complete_req(res);
			return;
		}

		post_cqe(res);

		// now recheck if new events came while we were processing
		// the previous batch.
	} while (refs_drop(req->poll_refs));
}

epoll_issue(issue_flags) {
	queue_poll();
	return;
}

But it might be better to just poll the epoll fd, reuse all the
io_uring polling machinery, and implement IO_URING_F_MULTISHOT for
the epoll opcode.

epoll_issue(issue_flags) {
	if (!(flags & IO_URING_F_MULTISHOT))
		return -EAGAIN;

	res = epoll_check_events();
	post_cqe(res);
	etc.
}

I think that would make this patch quite trivial, including
the multishot mode.

Pavel Begunkov Feb. 9, 2025, 12:24 a.m. UTC | #2

On 2/8/25 23:27, Pavel Begunkov wrote:
...
> But it might be better to just poll the epoll fd, reuse all the
> io_uring polling machinery, and implement IO_URING_F_MULTISHOT for
> the epoll opcode.
> 
> epoll_issue(issue_flags) {
>      if (!(flags & IO_URING_F_MULTISHOT))
>          return -EAGAIN;
> 
>      res = epoll_check_events();
>      post_cqe(res);
>      etc.
> }
> 
> I think that would make this patch quite trivial, including
> the multishot mode.

Something like this instead of the last patch. Completely untested,
the eventpoll.c hunk is dirty might be incorrect, need to pass the
right mask for polling, and all that. At least it looks simpler,
and probably doesn't need half of the prep patches.


diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index b96cc9193517..99dd8c1a2f2c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1996,33 +1996,6 @@ static int ep_try_send_events(struct eventpoll *ep,
  	return res;
  }
  
-static int ep_poll_queue(struct eventpoll *ep,
-			 struct epoll_event __user *events, int maxevents,
-			 struct wait_queue_entry *wait)
-{
-	int res = 0, eavail;
-
-	/* See ep_poll() for commentary */
-	eavail = ep_events_available(ep);
-	while (1) {
-		if (eavail) {
-			res = ep_try_send_events(ep, events, maxevents);
-			if (res)
-				return res;
-		}
-		if (!list_empty_careful(&wait->entry))
-			break;
-		write_lock_irq(&ep->lock);
-		eavail = ep_events_available(ep);
-		if (!eavail)
-			__add_wait_queue_exclusive(&ep->wq, wait);
-		write_unlock_irq(&ep->lock);
-		if (!eavail)
-			break;
-	}
-	return -EIOCBQUEUED;
-}
-
  static int __epoll_wait_remove(struct eventpoll *ep,
  			       struct wait_queue_entry *wait, int timed_out)
  {
@@ -2517,16 +2490,22 @@ static int ep_check_params(struct file *file, struct epoll_event __user *evs,
  	return 0;
  }
  
-int epoll_queue(struct file *file, struct epoll_event __user *events,
-		int maxevents, struct wait_queue_entry *wait)
+int epoll_sendevents(struct file *file, struct epoll_event __user *events,
+		     int maxevents)
  {
-	int ret;
+	int res = 0, eavail;
  
  	ret = ep_check_params(file, events, maxevents);
  	if (unlikely(ret))
  		return ret;
  
-	return ep_poll_queue(file->private_data, events, maxevents, wait);
+	eavail = ep_events_available(ep);
+	if (eavail) {
+		res = ep_try_send_events(ep, events, maxevents);
+		if (res)
+			return res;
+	}
+	return 0;
  }
  
  /*
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 6c088d5e945b..751e3f325927 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -25,9 +25,8 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long t
  /* Used to release the epoll bits inside the "struct file" */
  void eventpoll_release_file(struct file *file);
  
-/* Use to reap events, and/or queue for a callback on new events */
-int epoll_queue(struct file *file, struct epoll_event __user *events,
-		int maxevents, struct wait_queue_entry *wait);
+int epoll_sendevents(struct file *file, struct epoll_event __user *events,
+		int maxevents);
  
  /* Remove wait entry */
  int epoll_wait_remove(struct file *file, struct wait_queue_entry *wait);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e11c82638527..a559e1e1544a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -278,6 +278,7 @@ enum io_uring_op {
  	IORING_OP_FTRUNCATE,
  	IORING_OP_BIND,
  	IORING_OP_LISTEN,
+	IORING_OP_EPOLL_WAIT,
  
  	/* this goes last, obviously */
  	IORING_OP_LAST,
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 7848d9cc073d..6d2c48ba1923 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -20,6 +20,12 @@ struct io_epoll {
  	struct epoll_event		event;
  };
  
+struct io_epoll_wait {
+	struct file			*file;
+	int				maxevents;
+	struct epoll_event __user	*events;
+};
+
  int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  {
  	struct io_epoll *epoll = io_kiocb_to_cmd(req, struct io_epoll);
@@ -57,3 +63,30 @@ int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
  	io_req_set_res(req, ret, 0);
  	return IOU_OK;
  }
+
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	if (sqe->off || sqe->rw_flags || sqe->buf_index || sqe->splice_fd_in)
+		return -EINVAL;
+
+	iew->maxevents = READ_ONCE(sqe->len);
+	iew->events = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	return 0;
+}
+
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+	int ret;
+
+	ret = epoll_sendevents(req->file, iew->events, iew->maxevents);
+	if (ret == 0)
+		return -EAGAIN;
+	if (ret < 0)
+		req_set_fail(req);
+
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/epoll.h b/io_uring/epoll.h
index 870cce11ba98..4111997c360b 100644
--- a/io_uring/epoll.h
+++ b/io_uring/epoll.h
@@ -3,4 +3,6 @@
  #if defined(CONFIG_EPOLL)
  int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
  int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags);
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags);
  #endif
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index e8baef4e5146..bd62d6068b61 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -514,6 +514,18 @@ const struct io_issue_def io_issue_defs[] = {
  		.async_size		= sizeof(struct io_async_msghdr),
  #else
  		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.needs_file		= 1,
+		.audit_skip		= 1,
+		.pollout		= 1,
+		.pollin			= 1,
+#if defined(CONFIG_EPOLL)
+		.prep			= io_epoll_wait_prep,
+		.issue			= io_epoll_wait,
+#else
+		.prep			= io_eopnotsupp_prep,
  #endif
  	},
  };
@@ -745,6 +757,9 @@ const struct io_cold_def io_cold_defs[] = {
  	[IORING_OP_LISTEN] = {
  		.name			= "LISTEN",
  	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.name			= "EPOLL_WAIT",
+	},
  };
  
  const char *io_uring_get_opcode(u8 opcode)

Jens Axboe Feb. 9, 2025, 4:19 p.m. UTC | #3

On 2/8/25 5:24 PM, Pavel Begunkov wrote:
> On 2/8/25 23:27, Pavel Begunkov wrote:
> ...
>> But it might be better to just poll the epoll fd, reuse all the
>> io_uring polling machinery, and implement IO_URING_F_MULTISHOT for
>> the epoll opcode.
>>
>> epoll_issue(issue_flags) {
>>      if (!(flags & IO_URING_F_MULTISHOT))
>>          return -EAGAIN;
>>
>>      res = epoll_check_events();
>>      post_cqe(res);
>>      etc.
>> }
>>
>> I think that would make this patch quite trivial, including
>> the multishot mode.
> 
> Something like this instead of the last patch. Completely untested,
> the eventpoll.c hunk is dirty might be incorrect, need to pass the
> right mask for polling, and all that. At least it looks simpler,
> and probably doesn't need half of the prep patches.

I like that idea! I'll roll with it and get it finalized and then do
some testing.

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index e2fef264ff8b..031ba708a81d 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -369,6 +369,10 @@  struct io_ring_ctx {
 	struct io_alloc_cache	futex_cache;
 #endif
 
+#ifdef CONFIG_EPOLL
+	struct hlist_head	epoll_list;
+#endif
+
 	const struct cred	*sq_creds;	/* cred used for __io_sq_thread() */
 	struct io_sq_data	*sq_data;	/* if using sq thread polling */
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e11c82638527..a559e1e1544a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -278,6 +278,7 @@  enum io_uring_op {
 	IORING_OP_FTRUNCATE,
 	IORING_OP_BIND,
 	IORING_OP_LISTEN,
+	IORING_OP_EPOLL_WAIT,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index 0870060bac7c..d1af9496d9b3 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -17,6 +17,7 @@ 
 #include "timeout.h"
 #include "waitid.h"
 #include "futex.h"
+#include "epoll.h"
 #include "cancel.h"
 
 struct io_cancel {
@@ -128,6 +129,10 @@  int io_try_cancel(struct io_uring_task *tctx, struct io_cancel_data *cd,
 	if (ret != -ENOENT)
 		return ret;
 
+	ret = io_epoll_wait_cancel(ctx, cd, issue_flags);
+	if (ret != -ENOENT)
+		return ret;
+
 	spin_lock(&ctx->completion_lock);
 	if (!(cd->flags & IORING_ASYNC_CANCEL_FD))
 		ret = io_timeout_cancel(ctx, cd);
diff --git a/io_uring/epoll.c b/io_uring/epoll.c
index 7848d9cc073d..8f54bb1c39de 100644
--- a/io_uring/epoll.c
+++ b/io_uring/epoll.c
@@ -11,6 +11,7 @@ 
 
 #include "io_uring.h"
 #include "epoll.h"
+#include "poll.h"
 
 struct io_epoll {
 	struct file			*file;
@@ -20,6 +21,13 @@  struct io_epoll {
 	struct epoll_event		event;
 };
 
+struct io_epoll_wait {
+	struct file			*file;
+	int				maxevents;
+	struct epoll_event __user	*events;
+	struct wait_queue_entry		wait;
+};
+
 int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_epoll *epoll = io_kiocb_to_cmd(req, struct io_epoll);
@@ -57,3 +65,138 @@  int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags)
 	io_req_set_res(req, ret, 0);
 	return IOU_OK;
 }
+
+static void __io_epoll_finish(struct io_kiocb *req, int res)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	lockdep_assert_held(&req->ctx->uring_lock);
+
+	epoll_wait_remove(req->file, &iew->wait);
+	hlist_del_init(&req->hash_node);
+	io_req_set_res(req, res, 0);
+	req->io_task_work.func = io_req_task_complete;
+	io_req_task_work_add(req);
+}
+
+static void __io_epoll_cancel(struct io_kiocb *req)
+{
+	__io_epoll_finish(req, -ECANCELED);
+}
+
+static bool __io_epoll_wait_cancel(struct io_kiocb *req)
+{
+	io_poll_mark_cancelled(req);
+	if (io_poll_get_ownership(req))
+		__io_epoll_cancel(req);
+	return true;
+}
+
+bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
+			      bool cancel_all)
+{
+	return io_cancel_remove_all(ctx, tctx, &ctx->epoll_list, cancel_all, __io_epoll_wait_cancel);
+}
+
+int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+			 unsigned int issue_flags)
+{
+	return io_cancel_remove(ctx, cd, issue_flags, &ctx->epoll_list, __io_epoll_wait_cancel);
+}
+
+static void io_epoll_retry(struct io_kiocb *req, struct io_tw_state *ts)
+{
+	int v;
+
+	do {
+		v = atomic_read(&req->poll_refs);
+		if (unlikely(v != 1)) {
+			if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
+				return;
+			if (v & IO_POLL_CANCEL_FLAG) {
+				__io_epoll_cancel(req);
+				return;
+			}
+		}
+		v &= IO_POLL_REF_MASK;
+	} while (atomic_sub_return(v, &req->poll_refs) & IO_POLL_REF_MASK);
+
+	io_req_task_submit(req, ts);
+}
+
+static int io_epoll_execute(struct io_kiocb *req)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	list_del_init_careful(&iew->wait.entry);
+	if (io_poll_get_ownership(req)) {
+		req->io_task_work.func = io_epoll_retry;
+		io_req_task_work_add(req);
+	}
+
+	return 1;
+}
+
+static __cold int io_epoll_pollfree_wake(struct io_kiocb *req)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	io_poll_mark_cancelled(req);
+	list_del_init_careful(&iew->wait.entry);
+	io_epoll_execute(req);
+	return 1;
+}
+
+static int io_epoll_wait_fn(struct wait_queue_entry *wait, unsigned mode,
+			    int sync, void *key)
+{
+	struct io_kiocb *req = wait->private;
+	__poll_t mask = key_to_poll(key);
+
+	if (unlikely(mask & POLLFREE))
+		return io_epoll_pollfree_wake(req);
+
+	return io_epoll_execute(req);
+}
+
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+
+	if (sqe->off || sqe->rw_flags || sqe->buf_index || sqe->splice_fd_in)
+		return -EINVAL;
+
+	iew->maxevents = READ_ONCE(sqe->len);
+	iew->events = u64_to_user_ptr(READ_ONCE(sqe->addr));
+
+	iew->wait.flags = 0;
+	iew->wait.private = req;
+	iew->wait.func = io_epoll_wait_fn;
+	INIT_LIST_HEAD(&iew->wait.entry);
+	INIT_HLIST_NODE(&req->hash_node);
+	atomic_set(&req->poll_refs, 0);
+	return 0;
+}
+
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_epoll_wait *iew = io_kiocb_to_cmd(req, struct io_epoll_wait);
+	struct io_ring_ctx *ctx = req->ctx;
+	int ret;
+
+	io_ring_submit_lock(ctx, issue_flags);
+
+	ret = epoll_queue(req->file, iew->events, iew->maxevents, &iew->wait);
+	if (ret == -EIOCBQUEUED) {
+		if (hlist_unhashed(&req->hash_node))
+			hlist_add_head(&req->hash_node, &ctx->epoll_list);
+		io_ring_submit_unlock(ctx, issue_flags);
+		return IOU_ISSUE_SKIP_COMPLETE;
+	} else if (ret < 0) {
+		req_set_fail(req);
+	}
+	hlist_del_init(&req->hash_node);
+	io_ring_submit_unlock(ctx, issue_flags);
+	io_req_set_res(req, ret, 0);
+	return IOU_OK;
+}
diff --git a/io_uring/epoll.h b/io_uring/epoll.h
index 870cce11ba98..296940d89063 100644
--- a/io_uring/epoll.h
+++ b/io_uring/epoll.h
@@ -1,6 +1,28 @@ 
 // SPDX-License-Identifier: GPL-2.0
 
+#include "cancel.h"
+
 #if defined(CONFIG_EPOLL)
+int io_epoll_wait_cancel(struct io_ring_ctx *ctx, struct io_cancel_data *cd,
+			 unsigned int issue_flags);
+bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx, struct io_uring_task *tctx,
+			      bool cancel_all);
+
 int io_epoll_ctl_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
 int io_epoll_ctl(struct io_kiocb *req, unsigned int issue_flags);
+int io_epoll_wait_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_epoll_wait(struct io_kiocb *req, unsigned int issue_flags);
+#else
+static inline bool io_epoll_wait_remove_all(struct io_ring_ctx *ctx,
+					    struct io_uring_task *tctx,
+					    bool cancel_all)
+{
+	return false;
+}
+static inline int io_epoll_wait_cancel(struct io_ring_ctx *ctx,
+				       struct io_cancel_data *cd,
+				       unsigned int issue_flags)
+{
+	return 0;
+}
 #endif
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ec98a0ec6f34..73b9246eaa50 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -93,6 +93,7 @@ 
 #include "notif.h"
 #include "waitid.h"
 #include "futex.h"
+#include "epoll.h"
 #include "napi.h"
 #include "uring_cmd.h"
 #include "msg_ring.h"
@@ -356,6 +357,9 @@  static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_HLIST_HEAD(&ctx->waitid_list);
 #ifdef CONFIG_FUTEX
 	INIT_HLIST_HEAD(&ctx->futex_list);
+#endif
+#ifdef CONFIG_EPOLL
+	INIT_HLIST_HEAD(&ctx->epoll_list);
 #endif
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
@@ -3079,6 +3083,7 @@  static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	ret |= io_poll_remove_all(ctx, tctx, cancel_all);
 	ret |= io_waitid_remove_all(ctx, tctx, cancel_all);
 	ret |= io_futex_remove_all(ctx, tctx, cancel_all);
+	ret |= io_epoll_wait_remove_all(ctx, tctx, cancel_all);
 	ret |= io_uring_try_cancel_uring_cmd(ctx, tctx, cancel_all);
 	mutex_unlock(&ctx->uring_lock);
 	ret |= io_kill_timeouts(ctx, tctx, cancel_all);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index e8baef4e5146..44553a657476 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -514,6 +514,17 @@  const struct io_issue_def io_issue_defs[] = {
 		.async_size		= sizeof(struct io_async_msghdr),
 #else
 		.prep			= io_eopnotsupp_prep,
+#endif
+	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.audit_skip		= 1,
+#if defined(CONFIG_EPOLL)
+		.prep			= io_epoll_wait_prep,
+		.issue			= io_epoll_wait,
+#else
+		.prep			= io_eopnotsupp_prep,
 #endif
 	},
 };
@@ -745,6 +756,9 @@  const struct io_cold_def io_cold_defs[] = {
 	[IORING_OP_LISTEN] = {
 		.name			= "LISTEN",
 	},
+	[IORING_OP_EPOLL_WAIT] = {
+		.name			= "EPOLL_WAIT",
+	},
 };
 
 const char *io_uring_get_opcode(u8 opcode)

[7/7] io_uring/epoll: add support for IORING_OP_EPOLL_WAIT

Commit Message

Comments

Patch