[for-next,5/7] io_uring: post msg_ring CQE in task context

Message ID	bb0e9ee516e182802da798258f303bf22ecdc151.1670207706.git.asml.silence@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <io-uring-owner@kernel.org> From: Pavel Begunkov <asml.silence@gmail.com> To: io-uring@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>, asml.silence@gmail.com Subject: [PATCH for-next 5/7] io_uring: post msg_ring CQE in task context Date: Mon, 5 Dec 2022 02:44:29 +0000 Message-Id: <bb0e9ee516e182802da798258f303bf22ecdc151.1670207706.git.asml.silence@gmail.com> In-Reply-To: <cover.1670207706.git.asml.silence@gmail.com> References: <cover.1670207706.git.asml.silence@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	CQ locking optimisation \| expand [for-next,0/7] CQ locking optimisation [for-next,1/7] io_uring: skip overflow CQE posting for dying ring [for-next,2/7] io_uring: don't check overflow flush failures [for-next,3/7] io_uring: complete all requests in task context [for-next,4/7] io_uring: force multishot CQEs into task context [for-next,5/7] io_uring: post msg_ring CQE in task context [for-next,6/7] io_uring: use tw for putting rsrc [for-next,7/7] io_uring: skip spinlocking for ->task_complete

Pavel Begunkov Dec. 5, 2022, 2:44 a.m. UTC

We want to limit post_aux_cqe() to the task context when ->task_complete
is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to another
thread. Instead of trying to invent a new delayed CQE posting mechanism
push them into the overflow list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 12 ++++++++++++
 io_uring/io_uring.h |  2 ++
 io_uring/msg_ring.c | 14 ++++++++++++--
 3 files changed, 26 insertions(+), 2 deletions(-)

Jens Axboe Dec. 5, 2022, 11:53 a.m. UTC | #1

On 12/4/22 7:44?PM, Pavel Begunkov wrote:
> We want to limit post_aux_cqe() to the task context when ->task_complete
> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to another
> thread. Instead of trying to invent a new delayed CQE posting mechanism
> push them into the overflow list.

This is really the only one out of the series that I'm not a big fan of.
If we always rely on overflow for msg_ring, then that basically removes
it from being usable in a higher performance setting.

The natural way to do this would be to post the cqe via task_work for
the target, ring, but we also don't any storage available for that.
Might still be better to alloc something ala

struct tw_cqe_post {
	struct task_work work;
	s32 res;
	u32 flags;
	u64 user_data;
}

and post it with that?

Dylan Yudaken Dec. 5, 2022, 3:12 p.m. UTC | #2

On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
> > We want to limit post_aux_cqe() to the task context when -
> > >task_complete
> > is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
> > another
> > thread. Instead of trying to invent a new delayed CQE posting
> > mechanism
> > push them into the overflow list.
> 
> This is really the only one out of the series that I'm not a big fan
> of.
> If we always rely on overflow for msg_ring, then that basically
> removes
> it from being usable in a higher performance setting.
> 
> The natural way to do this would be to post the cqe via task_work for
> the target, ring, but we also don't any storage available for that.
> Might still be better to alloc something ala
> 
> struct tw_cqe_post {
>         struct task_work work;
>         s32 res;
>         u32 flags;
>         u64 user_data;
> }
> 
> and post it with that?
> 

It might work to post the whole request to the target, post the cqe,
and then return the request back to the originating ring via tw for the
msg_ring CQE and cleanup.

Jens Axboe Dec. 5, 2022, 3:18 p.m. UTC | #3

On 12/5/22 8:12?AM, Dylan Yudaken wrote:
> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>> We want to limit post_aux_cqe() to the task context when -
>>>> task_complete
>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>> another
>>> thread. Instead of trying to invent a new delayed CQE posting
>>> mechanism
>>> push them into the overflow list.
>>
>> This is really the only one out of the series that I'm not a big fan
>> of.
>> If we always rely on overflow for msg_ring, then that basically
>> removes
>> it from being usable in a higher performance setting.
>>
>> The natural way to do this would be to post the cqe via task_work for
>> the target, ring, but we also don't any storage available for that.
>> Might still be better to alloc something ala
>>
>> struct tw_cqe_post {
>> ????????struct task_work work;
>> ????????s32 res;
>> ????????u32 flags;
>> ????????u64 user_data;
>> }
>>
>> and post it with that?
>>
> 
> It might work to post the whole request to the target, post the cqe,
> and then return the request back to the originating ring via tw for the
> msg_ring CQE and cleanup.

I did consider that, but then you need to ref that request as well as
bounce it twice via task_work. Probably easier to just alloc at that
point? Though if you do that, then the target cqe would post later than
the original. And potentially lose -EOVERFLOW if the target ring is
overflown...

Pavel Begunkov Dec. 6, 2022, 10:42 a.m. UTC | #4

On 12/5/22 15:18, Jens Axboe wrote:
> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>> We want to limit post_aux_cqe() to the task context when -
>>>>> task_complete
>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>> another
>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>> mechanism
>>>> push them into the overflow list.
>>>
>>> This is really the only one out of the series that I'm not a big fan
>>> of.
>>> If we always rely on overflow for msg_ring, then that basically
>>> removes
>>> it from being usable in a higher performance setting.
>>>
>>> The natural way to do this would be to post the cqe via task_work for
>>> the target, ring, but we also don't any storage available for that.
>>> Might still be better to alloc something ala
>>>
>>> struct tw_cqe_post {
>>> ????????struct task_work work;
>>> ????????s32 res;
>>> ????????u32 flags;
>>> ????????u64 user_data;
>>> }
>>>
>>> and post it with that?

What does it change performance wise? I need to add a patch to
"try to flush before overflowing", but apart from that it's one
additional allocation in both cases but adds additional
raw / not-batch task_work.

>> It might work to post the whole request to the target, post the cqe,
>> and then return the request back to the originating ring via tw for the
>> msg_ring CQE and cleanup.
> 
> I did consider that, but then you need to ref that request as well as
> bounce it twice via task_work. Probably easier to just alloc at that
> point? Though if you do that, then the target cqe would post later than
> the original. And potentially lose -EOVERFLOW if the target ring is
> overflown...

Double tw is interesting for future plans, but yeah, I don't think
it's so much of a difference in context of this series.

Jens Axboe Dec. 6, 2022, 4:06 p.m. UTC | #5

On 12/6/22 3:42?AM, Pavel Begunkov wrote:
> On 12/5/22 15:18, Jens Axboe wrote:
>> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>>> We want to limit post_aux_cqe() to the task context when -
>>>>>> task_complete
>>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>>> another
>>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>>> mechanism
>>>>> push them into the overflow list.
>>>>
>>>> This is really the only one out of the series that I'm not a big fan
>>>> of.
>>>> If we always rely on overflow for msg_ring, then that basically
>>>> removes
>>>> it from being usable in a higher performance setting.
>>>>
>>>> The natural way to do this would be to post the cqe via task_work for
>>>> the target, ring, but we also don't any storage available for that.
>>>> Might still be better to alloc something ala
>>>>
>>>> struct tw_cqe_post {
>>>> ????????struct task_work work;
>>>> ????????s32 res;
>>>> ????????u32 flags;
>>>> ????????u64 user_data;
>>>> }
>>>>
>>>> and post it with that?
> 
> What does it change performance wise? I need to add a patch to
> "try to flush before overflowing", but apart from that it's one
> additional allocation in both cases but adds additional
> raw / not-batch task_work.

It adds alloc+free for each one, and overflow flush needed on the
recipient side. It also adds a cq lock/unlock, though I don't think that
part will be a big deal.

>>> It might work to post the whole request to the target, post the cqe,
>>> and then return the request back to the originating ring via tw for the
>>> msg_ring CQE and cleanup.
>>
>> I did consider that, but then you need to ref that request as well as
>> bounce it twice via task_work. Probably easier to just alloc at that
>> point? Though if you do that, then the target cqe would post later than
>> the original. And potentially lose -EOVERFLOW if the target ring is
>> overflown...
> 
> Double tw is interesting for future plans, but yeah, I don't think
> it's so much of a difference in context of this series.

I did a half assed patch for that... Below.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 36cb63e4174f..974eeaac313f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1254,10 +1254,10 @@ static void io_req_local_work_add(struct io_kiocb *req)
 	__io_cqring_wake(ctx);
 }
 
-void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
+void __io_req_task_work_add_ctx(struct io_ring_ctx *ctx, struct io_kiocb *req,
+				struct task_struct *task, bool allow_local)
 {
-	struct io_uring_task *tctx = req->task->io_uring;
-	struct io_ring_ctx *ctx = req->ctx;
+	struct io_uring_task *tctx = task->io_uring;
 
 	if (allow_local && ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
 		io_req_local_work_add(req);
@@ -1277,6 +1277,11 @@ void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
 	io_fallback_tw(tctx);
 }
 
+void __io_req_task_work_add(struct io_kiocb *req, bool allow_local)
+{
+	__io_req_task_work_add_ctx(req->ctx, req, req->task, allow_local);
+}
+
 static void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 {
 	struct llist_node *node;
@@ -1865,7 +1870,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 
 	/* If the op doesn't have a file, we're not polling for it */
-	if ((req->ctx->flags & IORING_SETUP_IOPOLL) && req->file)
+	if ((req->ctx->flags & IORING_SETUP_IOPOLL) && !def->noiopoll && req->file)
 		io_iopoll_req_issued(req, issue_flags);
 
 	return 0;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index c20f15f5024d..3d24cba17504 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -51,6 +51,8 @@ static inline bool io_req_ffs_set(struct io_kiocb *req)
 }
 
 void __io_req_task_work_add(struct io_kiocb *req, bool allow_local);
+void __io_req_task_work_add_ctx(struct io_ring_ctx *ctx, struct io_kiocb *req,
+				struct task_struct *task, bool allow_local);
 bool io_is_uring_fops(struct file *file);
 bool io_alloc_async_data(struct io_kiocb *req);
 void io_req_task_queue(struct io_kiocb *req);
diff --git a/io_uring/msg_ring.c b/io_uring/msg_ring.c
index 7717fe519b07..fdc189b04d30 100644
--- a/io_uring/msg_ring.c
+++ b/io_uring/msg_ring.c
@@ -23,14 +23,41 @@ struct io_msg {
 	u32 flags;
 };
 
+static void io_msg_cqe_post(struct io_kiocb *req, bool *locked)
+{
+	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
+	struct io_ring_ctx *ctx = req->file->private_data;
+	unsigned issue_flags = 0;
+	int ret = 0;
+
+	if (!io_post_aux_cqe(ctx, msg->user_data, msg->len, msg->flags))
+		ret = -EOVERFLOW;
+
+	io_req_set_res(req, ret, 0);
+	if (!*locked)
+		issue_flags = IO_URING_F_UNLOCKED;
+	io_req_complete_post(req, issue_flags);
+}
+
+static int io_msg_post_remote(struct io_ring_ctx *ctx, struct io_kiocb *req)
+{
+	req->io_task_work.func = io_msg_cqe_post;
+	__io_req_task_work_add_ctx(ctx, req, ctx->submitter_task, true);
+	return IOU_ISSUE_SKIP_COMPLETE;
+}
+
 /* post cqes to another ring */
-static int io_msg_post_cqe(struct io_ring_ctx *ctx,
-			   u64 user_data, s32 res, u32 cflags)
+static int io_msg_post_cqe(struct io_ring_ctx *ctx, struct io_kiocb *req)
 {
-	if (!ctx->task_complete || current == ctx->submitter_task)
-		return io_post_aux_cqe(ctx, user_data, res, cflags);
-	else
-		return io_post_aux_cqe_overflow(ctx, user_data, res, cflags);
+	struct io_msg *msg = io_kiocb_to_cmd(req, struct io_msg);
+
+	if (!ctx->task_complete || current == ctx->submitter_task) {
+		if (io_post_aux_cqe(ctx, msg->user_data, msg->len, msg->flags))
+			return 0;
+		return -EOVERFLOW;
+	}
+
+	return io_msg_post_remote(ctx, req);
 }
 
 static int io_msg_ring_data(struct io_kiocb *req)
@@ -41,10 +68,7 @@ static int io_msg_ring_data(struct io_kiocb *req)
 	if (msg->src_fd || msg->dst_fd || msg->flags)
 		return -EINVAL;
 
-	if (io_msg_post_cqe(target_ctx, msg->user_data, msg->len, 0))
-		return 0;
-
-	return -EOVERFLOW;
+	return io_msg_post_cqe(target_ctx, req);
 }
 
 static void io_double_unlock_ctx(struct io_ring_ctx *ctx,
@@ -126,8 +150,7 @@ static int io_msg_send_fd(struct io_kiocb *req, unsigned int issue_flags)
 	 * completes with -EOVERFLOW, then the sender must ensure that a
 	 * later IORING_OP_MSG_RING delivers the message.
 	 */
-	if (!io_msg_post_cqe(target_ctx, msg->user_data, msg->len, 0))
-		ret = -EOVERFLOW;
+	ret = io_msg_post_cqe(target_ctx, req);
 out_unlock:
 	io_double_unlock_ctx(ctx, target_ctx, issue_flags);
 	return ret;
@@ -173,13 +196,11 @@ int io_msg_ring(struct io_kiocb *req, unsigned int issue_flags)
 		break;
 	}
 
+	if (ret == IOU_ISSUE_SKIP_COMPLETE)
+		return IOU_ISSUE_SKIP_COMPLETE;
 done:
 	if (ret < 0)
 		req_set_fail(req);
 	io_req_set_res(req, ret, 0);
-	/* put file to avoid an attempt to IOPOLL the req */
-	if (!(req->flags & REQ_F_FIXED_FILE))
-		io_put_file(req->file);
-	req->file = NULL;
 	return IOU_OK;
 }
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 83dc0f9ad3b2..638df83895fb 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -436,6 +436,7 @@ const struct io_op_def io_op_defs[] = {
 	[IORING_OP_MSG_RING] = {
 		.needs_file		= 1,
 		.iopoll			= 1,
+		.noiopoll		= 1,
 		.name			= "MSG_RING",
 		.prep			= io_msg_ring_prep,
 		.issue			= io_msg_ring,
diff --git a/io_uring/opdef.h b/io_uring/opdef.h
index 3efe06d25473..e378eb240538 100644
--- a/io_uring/opdef.h
+++ b/io_uring/opdef.h
@@ -25,6 +25,8 @@ struct io_op_def {
 	unsigned		ioprio : 1;
 	/* supports iopoll */
 	unsigned		iopoll : 1;
+	/* don't iopoll for this request */
+	unsigned		noiopoll : 1;
 	/* opcode specific path will handle ->async_data allocation if needed */
 	unsigned		manual_alloc : 1;
 	/* size of async data needed, if any */

Pavel Begunkov Dec. 7, 2022, 3:59 a.m. UTC | #6

On 12/6/22 16:06, Jens Axboe wrote:
> On 12/6/22 3:42?AM, Pavel Begunkov wrote:
>> On 12/5/22 15:18, Jens Axboe wrote:
>>> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>>>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>>>> We want to limit post_aux_cqe() to the task context when -
>>>>>>> task_complete
>>>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>>>> another
>>>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>>>> mechanism
>>>>>> push them into the overflow list.
>>>>>
>>>>> This is really the only one out of the series that I'm not a big fan
>>>>> of.
>>>>> If we always rely on overflow for msg_ring, then that basically
>>>>> removes
>>>>> it from being usable in a higher performance setting.
>>>>>
>>>>> The natural way to do this would be to post the cqe via task_work for
>>>>> the target, ring, but we also don't any storage available for that.
>>>>> Might still be better to alloc something ala
>>>>>
>>>>> struct tw_cqe_post {
>>>>> ????????struct task_work work;
>>>>> ????????s32 res;
>>>>> ????????u32 flags;
>>>>> ????????u64 user_data;
>>>>> }
>>>>>
>>>>> and post it with that?
>>
>> What does it change performance wise? I need to add a patch to
>> "try to flush before overflowing", but apart from that it's one
>> additional allocation in both cases but adds additional
>> raw / not-batch task_work.
> 
> It adds alloc+free for each one, and overflow flush needed on the
> recipient side. It also adds a cq lock/unlock, though I don't think that
> part will be a big deal.

And that approach below does 2 tw swings, neither is ideal but
it feels like a bearable price for poking into another ring.

I sent a series with the double tw approach, should be better for
CQ ordering, can you pick it up instead? I don't use io_uring tw
infra of a ring the request doesn't belong to as it seems to me
like shooting yourself in the leg.

Jens Axboe Dec. 7, 2022, 3:42 p.m. UTC | #7

On 12/6/22 8:59 PM, Pavel Begunkov wrote:
> On 12/6/22 16:06, Jens Axboe wrote:
>> On 12/6/22 3:42?AM, Pavel Begunkov wrote:
>>> On 12/5/22 15:18, Jens Axboe wrote:
>>>> On 12/5/22 8:12?AM, Dylan Yudaken wrote:
>>>>> On Mon, 2022-12-05 at 04:53 -0700, Jens Axboe wrote:
>>>>>> On 12/4/22 7:44?PM, Pavel Begunkov wrote:
>>>>>>> We want to limit post_aux_cqe() to the task context when -
>>>>>>>> task_complete
>>>>>>> is set, and so we can't just deliver a IORING_OP_MSG_RING CQE to
>>>>>>> another
>>>>>>> thread. Instead of trying to invent a new delayed CQE posting
>>>>>>> mechanism
>>>>>>> push them into the overflow list.
>>>>>>
>>>>>> This is really the only one out of the series that I'm not a big fan
>>>>>> of.
>>>>>> If we always rely on overflow for msg_ring, then that basically
>>>>>> removes
>>>>>> it from being usable in a higher performance setting.
>>>>>>
>>>>>> The natural way to do this would be to post the cqe via task_work for
>>>>>> the target, ring, but we also don't any storage available for that.
>>>>>> Might still be better to alloc something ala
>>>>>>
>>>>>> struct tw_cqe_post {
>>>>>> ????????struct task_work work;
>>>>>> ????????s32 res;
>>>>>> ????????u32 flags;
>>>>>> ????????u64 user_data;
>>>>>> }
>>>>>>
>>>>>> and post it with that?
>>>
>>> What does it change performance wise? I need to add a patch to
>>> "try to flush before overflowing", but apart from that it's one
>>> additional allocation in both cases but adds additional
>>> raw / not-batch task_work.
>>
>> It adds alloc+free for each one, and overflow flush needed on the
>> recipient side. It also adds a cq lock/unlock, though I don't think that
>> part will be a big deal.
> 
> And that approach below does 2 tw swings, neither is ideal but
> it feels like a bearable price for poking into another ring.
> 
> I sent a series with the double tw approach, should be better for
> CQ ordering, can you pick it up instead? I don't use io_uring tw
> infra of a ring the request doesn't belong to as it seems to me
> like shooting yourself in the leg.

Yeah I think that's the right choice, it was just a quick hack on
my end to see if it was feasible. But it's not a good fit to use
our general tw infra for this.

[for-next,5/7] io_uring: post msg_ring CQE in task context

Commit Message

Comments

Patch