diff mbox series

[1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy

Message ID b1a047a1b2d55c1c245a78ca9772c31a9b3ceb12.1712268605.git.ozlinuxc@gmail.com (mailing list archive)
State New
Headers show
Series Add REQ_F_CQE_SKIP support to io_uring zerocopy | expand

Commit Message

Oliver Crumrine April 4, 2024, 10:17 p.m. UTC
In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
specifically disabled REQ_F_CQE_SKIP, as (at least from my
understanding) the userspace program wouldn't receive the
IORING_CQE_F_MORE flag in the result value.

To fix this, instead of keeping track of how many CQEs have been
received, and subtracting notifs from that, programs can keep track of
how many SQEs they have issued, and if a CQE is returned with an error,
they can simply subtract from how many notifs they expect to receive.

Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
---
 io_uring/net.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Pavel Begunkov April 5, 2024, 1:01 p.m. UTC | #1
On 4/4/24 23:17, Oliver Crumrine wrote:
> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
> specifically disabled REQ_F_CQE_SKIP, as (at least from my
> understanding) the userspace program wouldn't receive the
> IORING_CQE_F_MORE flag in the result value.

No. IORING_CQE_F_MORE means there will be another CQE from this
request, so a single CQE without IORING_CQE_F_MORE is trivially
fine.

The problem is the semantics, because by suppressing the first
CQE you're loosing the result value. You might rely on WAITALL
as other sends and "fail" (in terms of io_uring) the request
in case of a partial send posting 2 CQEs, but that's not a great
way and it's getting userspace complicated pretty easily.

In short, it was left out for later because there is a
better way to implement it, but it should be done carefully


> To fix this, instead of keeping track of how many CQEs have been
> received, and subtracting notifs from that, programs can keep track of

That's a benchmark way of doing it, more realistically
it'd be more like

event_loop() {
	cqe = wait_cqe();
	struct req *r = (struct req *)cqe->user_data;
	r->callback(r, cqe);
}

send_zc_callback(req, cqe) {
	if (cqe->flags & F_MORE) {
		// don't free the req
		// we should wait for another CQE
		...
	}
}

> how many SQEs they have issued, and if a CQE is returned with an error,
> they can simply subtract from how many notifs they expect to receive.

The design specifically untangles those two notions, i.e. there can
be a notification even when the main CQE fails (ret<0). It's safer
this way, even though AFAIK relying on errors would be fine with
current users (TCP/UDP).


> Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
> ---
>   io_uring/net.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 1e7665ff6ef7..822f49809b68 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -1044,9 +1044,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   
>   	if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>   		return -EINVAL;
> -	/* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
> -	if (req->flags & REQ_F_CQE_SKIP)
> -		return -EINVAL;
>   
>   	notif = zc->notif = io_alloc_notif(ctx);
>   	if (!notif)
> @@ -1342,7 +1339,8 @@ void io_sendrecv_fail(struct io_kiocb *req)
>   		req->cqe.res = sr->done_io;
>   
>   	if ((req->flags & REQ_F_NEED_CLEANUP) &&
> -	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
> +	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) &&
> +	    !(req->flags & REQ_F_CQE_SKIP))
>   		req->cqe.flags |= IORING_CQE_F_MORE;
>   }
>
Oliver Crumrine April 5, 2024, 8:04 p.m. UTC | #2
Pavel Begunkov wrote:
> On 4/4/24 23:17, Oliver Crumrine wrote:
> > In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
> > specifically disabled REQ_F_CQE_SKIP, as (at least from my
> > understanding) the userspace program wouldn't receive the
> > IORING_CQE_F_MORE flag in the result value.
>
> No. IORING_CQE_F_MORE means there will be another CQE from this
> request, so a single CQE without IORING_CQE_F_MORE is trivially
> fine.
>
> The problem is the semantics, because by suppressing the first
> CQE you're loosing the result value. You might rely on WAITALL
That's already happening with io_send.
> as other sends and "fail" (in terms of io_uring) the request
> in case of a partial send posting 2 CQEs, but that's not a great
> way and it's getting userspace complicated pretty easily.
>
> In short, it was left out for later because there is a
> better way to implement it, but it should be done carefully
Maybe we could put the return values in the notifs? That would be a
discrepancy between io_send and io_send_zc, though.
>
>
> > To fix this, instead of keeping track of how many CQEs have been
> > received, and subtracting notifs from that, programs can keep track of
>
> That's a benchmark way of doing it, more realistically
> it'd be more like
>
> event_loop() {
> 	cqe = wait_cqe();
> 	struct req *r = (struct req *)cqe->user_data;
> 	r->callback(r, cqe);
>
>
> send_zc_callback(req, cqe) {
> 	if (cqe->flags & F_MORE) {
> 		// don't free the req
> 		// we should wait for another CQE
> 		...
> 	}
> }
>
> > how many SQEs they have issued, and if a CQE is returned with an error,
> > they can simply subtract from how many notifs they expect to receive.
>
> The design specifically untangles those two notions, i.e. there can
> be a notification even when the main CQE fails (ret<0). It's safer
> this way, even though AFAIK relying on errors would be fine with
> current users (TCP/UDP).
>
>
> > Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
> > ---
> >   io_uring/net.c | 6 ++----
> >   1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/io_uring/net.c b/io_uring/net.c
> > index 1e7665ff6ef7..822f49809b68 100644
> > --- a/io_uring/net.c
> > +++ b/io_uring/net.c
> > @@ -1044,9 +1044,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> >
> >   	if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
> >   		return -EINVAL;
> > -	/* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
> > -	if (req->flags & REQ_F_CQE_SKIP)
> > -		return -EINVAL;
> >
> >   	notif = zc->notif = io_alloc_notif(ctx);
> >   	if (!notif)
> > @@ -1342,7 +1339,8 @@ void io_sendrecv_fail(struct io_kiocb *req)
> >   		req->cqe.res = sr->done_io;
> >
> >   	if ((req->flags & REQ_F_NEED_CLEANUP) &&
> > -	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
> > +	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) &&
> > +	    !(req->flags & REQ_F_CQE_SKIP))
> >   		req->cqe.flags |= IORING_CQE_F_MORE;
> >   }
> >
>
> --
> Pavel Begunkov
Pavel Begunkov April 6, 2024, 9:23 p.m. UTC | #3
On 4/5/24 21:04, Oliver Crumrine wrote:
> Pavel Begunkov wrote:
>> On 4/4/24 23:17, Oliver Crumrine wrote:
>>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
>>> specifically disabled REQ_F_CQE_SKIP, as (at least from my
>>> understanding) the userspace program wouldn't receive the
>>> IORING_CQE_F_MORE flag in the result value.
>>
>> No. IORING_CQE_F_MORE means there will be another CQE from this
>> request, so a single CQE without IORING_CQE_F_MORE is trivially
>> fine.
>>
>> The problem is the semantics, because by suppressing the first
>> CQE you're loosing the result value. You might rely on WAITALL
> That's already happening with io_send.

Right, and it's still annoying and hard to use

>> as other sends and "fail" (in terms of io_uring) the request
>> in case of a partial send posting 2 CQEs, but that's not a great
>> way and it's getting userspace complicated pretty easily.
>>
>> In short, it was left out for later because there is a
>> better way to implement it, but it should be done carefully
> Maybe we could put the return values in the notifs? That would be a
> discrepancy between io_send and io_send_zc, though.

Yes. And yes, having a custom flavour is not good. It'd only
be well usable if apart from returning the actual result
it also guarantees there will be one and only one CQE, then
the userspace doesn't have to do the dancing with counting
and checking F_MORE. In fact, I outlined before how a generic
solution may looks like:

https://github.com/axboe/liburing/issues/824

The only interesting part, IMHO, is to be able to merge the
main completion with its notification. Below is an old stash
rebased onto for-6.10. The only thing missing is relinking,
but maybe we don't even care about it. I need to cover it
well with tests.




commit ca5e4fb6d105b5dfdf3768d46ce01529b7bb88c5
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Sat Apr 6 15:46:38 2024 +0100

     io_uring/net: introduce single CQE send zc mode
     
     IORING_OP_SEND[MSG]_ZC requests are posting two completions, one to
     notify that the data was queued, and later a second, usually referred
     as "notification", to let the user know that the buffer used can be
     reused/freed. In some cases the user might not care about the main
     completion and would be content getting only the notification, which
     would allow to simplify the userspace.
     
     One example is when after a send the user would be waiting for the other
     end to get the message and reply back not pushing any more data in the
     meantime. Another case is unreliable protocols like UDP, which do not
     require a confirmation from the other end before dropping buffers, and
     so the notifications are usually posted shortly after the send request
     is queued.
     
     Add a flag merging completions into a single CQE. cqe->res will store
     the send's result as usual, and it will have IORING_CQE_F_NOTIF set if
     the buffer was potentially used. Timewise, it would be posted at the
     moment when the notification would have been originally completed.
     
     Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7bd10201a02b..e2b528c341c9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -356,6 +356,7 @@ enum io_uring_op {
  #define IORING_RECV_MULTISHOT		(1U << 1)
  #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
  #define IORING_SEND_ZC_REPORT_USAGE	(1U << 3)
+#define IORING_SEND_ZC_COMBINE_CQE	(1U << 4)
  
  /*
   * cqe.res for IORING_CQE_F_NOTIF if
diff --git a/io_uring/net.c b/io_uring/net.c
index a74287692071..052f030ab8f8 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -992,7 +992,19 @@ void io_send_zc_cleanup(struct io_kiocb *req)
  	}
  }
  
-#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF)
+static inline void io_sendzc_adjust_res(struct io_kiocb *req)
+{
+	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+
+	if (sr->flags & IORING_SEND_ZC_COMBINE_CQE) {
+		sr->notif->cqe.res = req->cqe.res;
+		req->flags |= REQ_F_CQE_SKIP;
+	}
+}
+
+#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | \
+			    IORING_RECVSEND_FIXED_BUF | \
+			    IORING_SEND_ZC_COMBINE_CQE)
  #define IO_ZC_FLAGS_VALID  (IO_ZC_FLAGS_COMMON | IORING_SEND_ZC_REPORT_USAGE)
  
  int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -1022,6 +1034,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  		if (zc->flags & ~IO_ZC_FLAGS_VALID)
  			return -EINVAL;
  		if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) {
+			if (zc->flags & IORING_SEND_ZC_COMBINE_CQE)
+				return -EINVAL;
  			io_notif_set_extended(notif);
  			io_notif_to_data(notif)->zc_report = true;
  		}
@@ -1197,6 +1211,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
  	else if (zc->done_io)
  		ret = zc->done_io;
  
+	io_req_set_res(req, ret, IORING_CQE_F_MORE);
+	io_sendzc_adjust_res(req);
+
  	/*
  	 * If we're in io-wq we can't rely on tw ordering guarantees, defer
  	 * flushing notif to io_send_zc_cleanup()
@@ -1205,7 +1222,6 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
  		io_notif_flush(zc->notif);
  		io_req_msg_cleanup(req, 0);
  	}
-	io_req_set_res(req, ret, IORING_CQE_F_MORE);
  	return IOU_OK;
  }
  
@@ -1258,6 +1274,9 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
  	else if (sr->done_io)
  		ret = sr->done_io;
  
+	io_req_set_res(req, ret, IORING_CQE_F_MORE);
+	io_sendzc_adjust_res(req);
+
  	/*
  	 * If we're in io-wq we can't rely on tw ordering guarantees, defer
  	 * flushing notif to io_send_zc_cleanup()
@@ -1266,7 +1285,6 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags)
  		io_notif_flush(sr->notif);
  		io_req_msg_cleanup(req, 0);
  	}
-	io_req_set_res(req, ret, IORING_CQE_F_MORE);
  	return IOU_OK;
  }
  
@@ -1278,8 +1296,10 @@ void io_sendrecv_fail(struct io_kiocb *req)
  		req->cqe.res = sr->done_io;
  
  	if ((req->flags & REQ_F_NEED_CLEANUP) &&
-	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
+	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) {
  		req->cqe.flags |= IORING_CQE_F_MORE;
+		io_sendzc_adjust_res(req);
+	}
  }
  
  int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
diff mbox series

Patch

diff --git a/io_uring/net.c b/io_uring/net.c
index 1e7665ff6ef7..822f49809b68 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1044,9 +1044,6 @@  int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
 		return -EINVAL;
-	/* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
-	if (req->flags & REQ_F_CQE_SKIP)
-		return -EINVAL;
 
 	notif = zc->notif = io_alloc_notif(ctx);
 	if (!notif)
@@ -1342,7 +1339,8 @@  void io_sendrecv_fail(struct io_kiocb *req)
 		req->cqe.res = sr->done_io;
 
 	if ((req->flags & REQ_F_NEED_CLEANUP) &&
-	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
+	    (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) &&
+	    !(req->flags & REQ_F_CQE_SKIP))
 		req->cqe.flags |= IORING_CQE_F_MORE;
 }