Message ID | 81104db1a04efbfcec90f5819081b4299542671a.1671559005.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] io_uring: wake up optimisations | expand |
On 12/20/22 17:58, Pavel Begunkov wrote: > NOT FOR INCLUSION, needs some ring poll workarounds > > Flush completions is done either from the submit syscall or by the > task_work, both are in the context of the submitter task, and when it > goes for a single threaded rings like implied by ->task_complete, there > won't be any waiters on ->cq_wait but the master task. That means that > there can be no tasks sleeping on cq_wait while we run > __io_submit_flush_completions() and so waking up can be skipped. Not trivial to benchmark as we need something to emulate a task_work coming in the middle of waiting. I used the diff below to complete nops in tw and removed preliminary tw runs for the "in the middle of waiting" part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or not. It gets around 15% more IOPS (6769526 -> 7803304), which correlates to 10% of wakeup cost in profiles. Another interesting part is that waitqueues are excessive for our purposes and we can replace cq_wait with something less heavier, e.g. atomic bit set diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 9d4c4078e8d0..5a4f03a4ea40 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -164,6 +164,7 @@ enum { * try to do it just before it is needed. */ #define IORING_SETUP_DEFER_TASKRUN (1U << 13) +#define IORING_SETUP_SKIP_CQWAKE (1U << 14) enum io_uring_op { IORING_OP_NOP, diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index a57b9008807c..68556dea060b 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -631,7 +631,7 @@ static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx) * it will re-check the wakeup conditions once we return we can safely * skip waking it up. */ - if (!ctx->task_complete) { + if (!(ctx->flags & IORING_SETUP_SKIP_CQWAKE)) { smp_mb(); __io_cqring_wake(ctx); } @@ -2519,18 +2519,6 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, if (!io_allowed_run_tw(ctx)) return -EEXIST; - do { - /* always run at least 1 task work to process local work */ - ret = io_run_task_work_ctx(ctx); - if (ret < 0) - return ret; - io_cqring_overflow_flush(ctx); - - /* if user messes with these they will just get an early return */ - if (__io_cqring_events_user(ctx) >= min_events) - return 0; - } while (ret > 0); - if (sig) { #ifdef CONFIG_COMPAT if (in_compat_syscall()) @@ -3345,16 +3333,6 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, mutex_unlock(&ctx->uring_lock); goto out; } - if (flags & IORING_ENTER_GETEVENTS) { - if (ctx->syscall_iopoll) - goto iopoll_locked; - /* - * Ignore errors, we'll soon call io_cqring_wait() and - * it should handle ownership problems if any. - */ - if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) - (void)io_run_local_work_locked(ctx); - } mutex_unlock(&ctx->uring_lock); } @@ -3721,7 +3699,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL | IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG | IORING_SETUP_SQE128 | IORING_SETUP_CQE32 | - IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN)) + IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN | + IORING_SETUP_SKIP_CQWAKE)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/io_uring/nop.c b/io_uring/nop.c index d956599a3c1b..77c686de3eb2 100644 --- a/io_uring/nop.c +++ b/io_uring/nop.c @@ -20,6 +20,6 @@ int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) */ int io_nop(struct io_kiocb *req, unsigned int issue_flags) { - io_req_set_res(req, 0, 0); - return IOU_OK; + io_req_queue_tw_complete(req, 0); + return IOU_ISSUE_SKIP_COMPLETE; }
On 12/20/22 11:06 AM, Pavel Begunkov wrote: > On 12/20/22 17:58, Pavel Begunkov wrote: >> NOT FOR INCLUSION, needs some ring poll workarounds >> >> Flush completions is done either from the submit syscall or by the >> task_work, both are in the context of the submitter task, and when it >> goes for a single threaded rings like implied by ->task_complete, there >> won't be any waiters on ->cq_wait but the master task. That means that >> there can be no tasks sleeping on cq_wait while we run >> __io_submit_flush_completions() and so waking up can be skipped. > > Not trivial to benchmark as we need something to emulate a task_work > coming in the middle of waiting. I used the diff below to complete nops > in tw and removed preliminary tw runs for the "in the middle of waiting" > part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or > not. > > It gets around 15% more IOPS (6769526 -> 7803304), which correlates > to 10% of wakeup cost in profiles. Another interesting part is that > waitqueues are excessive for our purposes and we can replace cq_wait > with something less heavier, e.g. atomic bit set I was thinking something like that the other day, for most purposes the wait infra is too heavy handed for our case. If we exclude poll for a second, everything else is internal and eg doesn't need IRQ safe locking at all. That's just one part of it. But I didn't have a good idea for the poll() side of things, which would be required to make some progress there.
On 12/20/22 18:10, Jens Axboe wrote: > On 12/20/22 11:06 AM, Pavel Begunkov wrote: >> On 12/20/22 17:58, Pavel Begunkov wrote: >>> NOT FOR INCLUSION, needs some ring poll workarounds >>> >>> Flush completions is done either from the submit syscall or by the >>> task_work, both are in the context of the submitter task, and when it >>> goes for a single threaded rings like implied by ->task_complete, there >>> won't be any waiters on ->cq_wait but the master task. That means that >>> there can be no tasks sleeping on cq_wait while we run >>> __io_submit_flush_completions() and so waking up can be skipped. >> >> Not trivial to benchmark as we need something to emulate a task_work >> coming in the middle of waiting. I used the diff below to complete nops >> in tw and removed preliminary tw runs for the "in the middle of waiting" >> part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or >> not. >> >> It gets around 15% more IOPS (6769526 -> 7803304), which correlates >> to 10% of wakeup cost in profiles. Another interesting part is that >> waitqueues are excessive for our purposes and we can replace cq_wait >> with something less heavier, e.g. atomic bit set > > I was thinking something like that the other day, for most purposes > the wait infra is too heavy handed for our case. If we exclude poll > for a second, everything else is internal and eg doesn't need IRQ > safe locking at all. That's just one part of it. But I didn't have Ring polling? We can move it to a separate waitqueue, probably with some tricks to remove extra ifs from the hot path, which I'm planning to add in v2. > a good idea for the poll() side of things, which would be required > to make some progress there. I'll play with replacing waitqueues with a bitops, should save some extra ~5% with the benchmark I used.
On 12/20/22 12:12?PM, Pavel Begunkov wrote: > On 12/20/22 18:10, Jens Axboe wrote: >> On 12/20/22 11:06?AM, Pavel Begunkov wrote: >>> On 12/20/22 17:58, Pavel Begunkov wrote: >>>> NOT FOR INCLUSION, needs some ring poll workarounds >>>> >>>> Flush completions is done either from the submit syscall or by the >>>> task_work, both are in the context of the submitter task, and when it >>>> goes for a single threaded rings like implied by ->task_complete, there >>>> won't be any waiters on ->cq_wait but the master task. That means that >>>> there can be no tasks sleeping on cq_wait while we run >>>> __io_submit_flush_completions() and so waking up can be skipped. >>> >>> Not trivial to benchmark as we need something to emulate a task_work >>> coming in the middle of waiting. I used the diff below to complete nops >>> in tw and removed preliminary tw runs for the "in the middle of waiting" >>> part. IORING_SETUP_SKIP_CQWAKE controls whether we use optimisation or >>> not. >>> >>> It gets around 15% more IOPS (6769526 -> 7803304), which correlates >>> to 10% of wakeup cost in profiles. Another interesting part is that >>> waitqueues are excessive for our purposes and we can replace cq_wait >>> with something less heavier, e.g. atomic bit set >> >> I was thinking something like that the other day, for most purposes >> the wait infra is too heavy handed for our case. If we exclude poll >> for a second, everything else is internal and eg doesn't need IRQ >> safe locking at all. That's just one part of it. But I didn't have > > Ring polling? We can move it to a separate waitqueue, probably with > some tricks to remove extra ifs from the hot path, which I'm > planning to add in v2. Yes, polling on the ring itself. And that was my thinking too, leave cq_wait just for that and then hide it behind <something something> to make it hopefully almost free for when the ring isn't polled. I just hadn't put any thought into what exactly that'd look like just yet. >> a good idea for the poll() side of things, which would be required >> to make some progress there. > > I'll play with replacing waitqueues with a bitops, should save some > extra ~5% with the benchmark I used. Excellent, looking forward to seeing that.
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 16a323a9ff70..a57b9008807c 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -618,6 +618,25 @@ static inline void __io_cq_unlock_post(struct io_ring_ctx *ctx) io_cqring_wake(ctx); } +static inline void __io_cq_unlock_post_flush(struct io_ring_ctx *ctx) + __releases(ctx->completion_lock) +{ + io_commit_cqring(ctx); + __io_cq_unlock(ctx); + io_commit_cqring_flush(ctx); + + /* + * As ->task_complete implies that the ring is single tasked, cq_wait + * may only be waited on by the current in io_cqring_wait(), but since + * it will re-check the wakeup conditions once we return we can safely + * skip waking it up. + */ + if (!ctx->task_complete) { + smp_mb(); + __io_cqring_wake(ctx); + } +} + void io_cq_unlock_post(struct io_ring_ctx *ctx) __releases(ctx->completion_lock) { @@ -1458,7 +1477,7 @@ static void __io_submit_flush_completions(struct io_ring_ctx *ctx) } } } - __io_cq_unlock_post(ctx); + __io_cq_unlock_post_flush(ctx); if (!wq_list_empty(&ctx->submit_state.compl_reqs)) { io_free_batch_list(ctx, state->compl_reqs.first);
NOT FOR INCLUSION, needs some ring poll workarounds Flush completions is done either from the submit syscall or by the task_work, both are in the context of the submitter task, and when it goes for a single threaded rings like implied by ->task_complete, there won't be any waiters on ->cq_wait but the master task. That means that there can be no tasks sleeping on cq_wait while we run __io_submit_flush_completions() and so waking up can be skipped. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- io_uring/io_uring.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-)