diff mbox series

io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups

Message ID e6bcdcb4-8a3d-48a1-8301-623cd30430e6@kernel.dk (mailing list archive)
State New
Headers show
Series io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups | expand

Commit Message

Jens Axboe Oct. 19, 2023, 1:01 p.m. UTC
With poll triggered retries, each event trigger will cause a task_work
item to be added for processing. If the ring is setup with
IORING_SETUP_DEFER_TASKRUN and a task is waiting on multiple events to
complete, any task_work addition will wake the task for processing these
items. This can cause more context switches than we would like, if the
application is deliberately waiting on multiple items to increase
efficiency.

For example, if an application has receive multishot armed for sockets
and wants to wait for N to complete within M usec of time, we should not
be waking up and processing these items until we have all the events we
asked for. By switching the poll trigger to lazy wake, we'll process
them when they are all ready, in one swoop, rather than wake multiple
times only to process one and then go back to sleep.

At some point we probably want to look at just making the lazy wake
the default, but for now, let's just selectively enable it where it
makes sense.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Comments

Pavel Begunkov Nov. 7, 2023, 5:47 p.m. UTC | #1
On 10/19/23 14:01, Jens Axboe wrote:
> With poll triggered retries, each event trigger will cause a task_work
> item to be added for processing. If the ring is setup with
> IORING_SETUP_DEFER_TASKRUN and a task is waiting on multiple events to
> complete, any task_work addition will wake the task for processing these
> items. This can cause more context switches than we would like, if the
> application is deliberately waiting on multiple items to increase
> efficiency.

I'm a bit late here. The reason why I didn't enable it for polling is
because it changes the behaviour. Let's think of a situation where we
want to accept 2 sockets, so we send a multishot accept and do
cq_wait(nr=2). It was perfectly fine before, but now it'll hung as
there's only 1 request and so 1 tw queued. And same would happen with
multishot recv even though it's more relevant to packet based protocols
like UDP.

It might be not specific to multishots:
listen(backlog=1), queue N oneshot accepts and cq_wait(N).

Now we get the first connection in the queue to accept.

	[IORING_OP_ACCEPT] = {
		.poll_exclusive		= 1,
	}

Due to poll_exclusive (I assume) it wakes only one accept. That
will try to queue up a tw for it, but it'll not be executed
because it's just one item. No other connection can be queued
up because of the backlog limit => presumably no other request
will be woken up => that first tw never runs. It's more subtle
and timing specific than the previous example, but nevertheless
it's concerning we might step on sth like that.


> For example, if an application has receive multishot armed for sockets
> and wants to wait for N to complete within M usec of time, we should not
> be waking up and processing these items until we have all the events we
> asked for. By switching the poll trigger to lazy wake, we'll process
> them when they are all ready, in one swoop, rather than wake multiple
> times only to process one and then go back to sleep.
> 
> At some point we probably want to look at just making the lazy wake
> the default, but for now, let's just selectively enable it where it
> makes sense.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/io_uring/poll.c b/io_uring/poll.c
> index 4c360ba8793a..d38d05edb4fa 100644
> --- a/io_uring/poll.c
> +++ b/io_uring/poll.c
> @@ -370,7 +370,7 @@ static void __io_poll_execute(struct io_kiocb *req, int mask)
>   	req->io_task_work.func = io_poll_task_func;
>   
>   	trace_io_uring_task_add(req, mask);
> -	io_req_task_work_add(req);
> +	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
>   }
>   
>   static inline void io_poll_execute(struct io_kiocb *req, int res)
>
Jens Axboe Nov. 7, 2023, 7:57 p.m. UTC | #2
On 11/7/23 10:47 AM, Pavel Begunkov wrote:
> On 10/19/23 14:01, Jens Axboe wrote:
>> With poll triggered retries, each event trigger will cause a task_work
>> item to be added for processing. If the ring is setup with
>> IORING_SETUP_DEFER_TASKRUN and a task is waiting on multiple events to
>> complete, any task_work addition will wake the task for processing these
>> items. This can cause more context switches than we would like, if the
>> application is deliberately waiting on multiple items to increase
>> efficiency.
> 
> I'm a bit late here. The reason why I didn't enable it for polling is
> because it changes the behaviour. Let's think of a situation where we
> want to accept 2 sockets, so we send a multishot accept and do
> cq_wait(nr=2). It was perfectly fine before, but now it'll hung as
> there's only 1 request and so 1 tw queued. And same would happen with
> multishot recv even though it's more relevant to packet based protocols
> like UDP.
> 
> It might be not specific to multishots:
> listen(backlog=1), queue N oneshot accepts and cq_wait(N).

I don't think it's unreasonable to assume that you need a backlog of N
if you batch wait for N to come in, with defer taskrun. I'm more curious
about non-accept cases that would potentially break.

> Now we get the first connection in the queue to accept.
> 
>     [IORING_OP_ACCEPT] = {
>         .poll_exclusive        = 1,
>     }
> 
> Due to poll_exclusive (I assume) it wakes only one accept. That

Right, if poll_exclusive is set, then we use exclusive waits. IFF the
caller also groks that and asks to wake just 1, then we will indeed just
wake one. I can certainly see this one being a problem, but at least
that could be handled by not doing this for exclusive poll waits.

> will try to queue up a tw for it, but it'll not be executed
> because it's just one item. No other connection can be queued
> up because of the backlog limit => presumably no other request
> will be woken up => that first tw never runs. It's more subtle
> and timing specific than the previous example, but nevertheless
> it's concerning we might step on sth like that.

Backlog aside, as per above, what other cases would we need to worry
about here? It's really anything where something in poll would need
processing to trigger more events.
 
IOW, if we can allow some known cases at least that'd certainly
help, or conversely exclude ones (like poll_exclusive).
Jens Axboe Nov. 8, 2023, 4:14 p.m. UTC | #3
In the spirit of getting some closure/progress on this, how about this
for starters? Disables lazy wake for poll_exclusive, and provides a flag
that can otherwise be set to disable it as well.
 
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index d3009d56af0b..03401c6ce5bb 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -431,6 +431,7 @@ enum {
 	/* keep async read/write and isreg together and in order */
 	REQ_F_SUPPORT_NOWAIT_BIT,
 	REQ_F_ISREG_BIT,
+	REQ_F_POLL_NO_LAZY_BIT,
 
 	/* not a real bit, just to check we're not overflowing the space */
 	__REQ_F_LAST_BIT,
@@ -498,6 +499,8 @@ enum {
 	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
 	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
 	REQ_F_HASH_LOCKED	= BIT(REQ_F_HASH_LOCKED_BIT),
+	/* don't use lazy poll wake for this request */
+	REQ_F_POLL_NO_LAZY	= BIT(REQ_F_POLL_NO_LAZY_BIT),
 };
 
 typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
diff --git a/io_uring/poll.c b/io_uring/poll.c
index d38d05edb4fa..4fed5514c379 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -366,11 +366,16 @@ void io_poll_task_func(struct io_kiocb *req, struct io_tw_state *ts)
 
 static void __io_poll_execute(struct io_kiocb *req, int mask)
 {
+	unsigned flags = 0;
+
 	io_req_set_res(req, mask, 0);
 	req->io_task_work.func = io_poll_task_func;
 
 	trace_io_uring_task_add(req, mask);
-	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
+
+	if (!(req->flags & REQ_F_POLL_NO_LAZY))
+		flags = IOU_F_TWQ_LAZY_WAKE;
+	__io_req_task_work_add(req, flags);
 }
 
 static inline void io_poll_execute(struct io_kiocb *req, int res)
@@ -526,10 +531,17 @@ static void __io_queue_proc(struct io_poll *poll, struct io_poll_table *pt,
 	poll->head = head;
 	poll->wait.private = (void *) wqe_private;
 
-	if (poll->events & EPOLLEXCLUSIVE)
+	if (poll->events & EPOLLEXCLUSIVE) {
+		/*
+		 * Exclusive waits may only wake a limited amount of entries
+		 * rather than all of them, this may interfere with lazy
+		 * wake if someone does wait(events > 1).
+		 */
+		req->flags |= REQ_F_POLL_NO_LAZY;
 		add_wait_queue_exclusive(head, &poll->wait);
-	else
+	} else {
 		add_wait_queue(head, &poll->wait);
+	}
 }
 
 static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head,
Pavel Begunkov Nov. 9, 2023, 1:07 p.m. UTC | #4
On 11/8/23 16:14, Jens Axboe wrote:
> In the spirit of getting some closure/progress on this, how about this
> for starters? Disables lazy wake for poll_exclusive, and provides a flag
> that can otherwise be set to disable it as well.

I'm not that concerned about the non-multishot accept case specifically,
if anything we can let it slide by saying that backlog=1 is unreasonable
there.

A bigger problem is that for the purpose of counting nr_wait passed into
the waiting syscall, users must never assume that a multishot request
can produce more than 1 completion.

For example this is not allowed:

queue_multishot_{rcv,accept}();
cq_wait(2);

So we can just leave the enablement patch alone and say it's the only
reasonable behaviour, and it was the indented way from the beginning
(hoping nobody will complain about it). Or do it via a flag, perhaps
SETUP_*.

For the multishot part I described the rules above. As for the problem
in general, it come from interdependecies b/w requests, so the rule is
the vague "there should be no deps b/w requests", but I'm not sure we
can spell at the moment the precise semantics.


> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index d3009d56af0b..03401c6ce5bb 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -431,6 +431,7 @@ enum {
>   	/* keep async read/write and isreg together and in order */
>   	REQ_F_SUPPORT_NOWAIT_BIT,
>   	REQ_F_ISREG_BIT,
> +	REQ_F_POLL_NO_LAZY_BIT,
>   
>   	/* not a real bit, just to check we're not overflowing the space */
>   	__REQ_F_LAST_BIT,
> @@ -498,6 +499,8 @@ enum {
>   	REQ_F_CLEAR_POLLIN	= BIT(REQ_F_CLEAR_POLLIN_BIT),
>   	/* hashed into ->cancel_hash_locked, protected by ->uring_lock */
>   	REQ_F_HASH_LOCKED	= BIT(REQ_F_HASH_LOCKED_BIT),
> +	/* don't use lazy poll wake for this request */
> +	REQ_F_POLL_NO_LAZY	= BIT(REQ_F_POLL_NO_LAZY_BIT),
>   };
>   
>   typedef void (*io_req_tw_func_t)(struct io_kiocb *req, struct io_tw_state *ts);
> diff --git a/io_uring/poll.c b/io_uring/poll.c
> index d38d05edb4fa..4fed5514c379 100644
> --- a/io_uring/poll.c
> +++ b/io_uring/poll.c
> @@ -366,11 +366,16 @@ void io_poll_task_func(struct io_kiocb *req, struct io_tw_state *ts)
>   
>   static void __io_poll_execute(struct io_kiocb *req, int mask)
>   {
> +	unsigned flags = 0;
> +

Why flag when you can just check the exclusive flag
in the poll entry?

>   	io_req_set_res(req, mask, 0);
>   	req->io_task_work.func = io_poll_task_func;
>   
>   	trace_io_uring_task_add(req, mask);
> -	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
> +
> +	if (!(req->flags & REQ_F_POLL_NO_LAZY))
> +		flags = IOU_F_TWQ_LAZY_WAKE;
> +	__io_req_task_work_add(req, flags);
>   }
>
diff mbox series

Patch

diff --git a/io_uring/poll.c b/io_uring/poll.c
index 4c360ba8793a..d38d05edb4fa 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -370,7 +370,7 @@  static void __io_poll_execute(struct io_kiocb *req, int mask)
 	req->io_task_work.func = io_poll_task_func;
 
 	trace_io_uring_task_add(req, mask);
-	io_req_task_work_add(req);
+	__io_req_task_work_add(req, IOU_F_TWQ_LAZY_WAKE);
 }
 
 static inline void io_poll_execute(struct io_kiocb *req, int res)