diff mbox series

IORING_CQE_F_COPIED

Message ID 4385ba84-55dd-6b08-0ca7-6b4a43f9d9a2@samba.org (mailing list archive)
State New
Headers show
Series IORING_CQE_F_COPIED | expand

Commit Message

Stefan Metzmacher Oct. 14, 2022, 11:06 a.m. UTC
Hi Pavel,

In the tests I made I used this version of IORING_CQE_F_COPIED:
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
(also inlined at the end)

Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)

metze

  include/uapi/linux/io_uring.h | 1 +
  io_uring/notif.c              | 5 +++++
  net/ipv4/ip_output.c          | 3 ++-
  net/ipv4/tcp.c                | 2 ++
  net/ipv6/ip6_output.c         | 3 ++-
  5 files changed, 12 insertions(+), 2 deletions(-)

Comments

Pavel Begunkov Oct. 17, 2022, 4:46 p.m. UTC | #1
Hey Stefan,

On 10/14/22 12:06, Stefan Metzmacher wrote:
> Hi Pavel,
> 
> In the tests I made I used this version of IORING_CQE_F_COPIED:
> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
> (also inlined at the end)
> 
> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)

I was thinking, can it be delivered separately but not in the same cqe?
The intention is to keep it off the IO path. For example, it can emit a
zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
options? And we can add a separate callback for that, will make a couple
of things better.

What do you think? Especially from the userspace usability perspective.
Stefan Metzmacher Oct. 18, 2022, 8:43 a.m. UTC | #2
Hi Pavel,

> On 10/14/22 12:06, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>> In the tests I made I used this version of IORING_CQE_F_COPIED:
>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
>> (also inlined at the end)
>>
>> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
> 
> I was thinking, can it be delivered separately but not in the same cqe?
> The intention is to keep it off the IO path. For example, it can emit a
> zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
> options? And we can add a separate callback for that, will make a couple
> of things better.
> 
> What do you think? Especially from the userspace usability perspective.

So far I can't think of any other way that would be useful yet,
but that doesn't mean something else might exist...

IORING_CQE_F_COPIED is available per request and makes it possible
to judge why the related SENDMSG_ZC was fast or not.
It's also available in trace-cmd report.

Everything else would likely re-introduce similar complexity like we
had with the notif slots.

Instead of a new IORING_CQE_F_COPIED flag we could also set
cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different.

As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
Can you be more verbose why you're thinking about something different?

metze
Pavel Begunkov Oct. 19, 2022, 3:06 p.m. UTC | #3
On 10/18/22 09:43, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>> On 10/14/22 12:06, Stefan Metzmacher wrote:
>>> Hi Pavel,
>>>
>>> In the tests I made I used this version of IORING_CQE_F_COPIED:
>>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
>>> (also inlined at the end)
>>>
>>> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
>>
>> I was thinking, can it be delivered separately but not in the same cqe?
>> The intention is to keep it off the IO path. For example, it can emit a
>> zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
>> options? And we can add a separate callback for that, will make a couple
>> of things better.
>>
>> What do you think? Especially from the userspace usability perspective.
> 
> So far I can't think of any other way that would be useful yet,
> but that doesn't mean something else might exist...
> 
> IORING_CQE_F_COPIED is available per request and makes it possible
> to judge why the related SENDMSG_ZC was fast or not.
> It's also available in trace-cmd report.
> 
> Everything else would likely re-introduce similar complexity like we
> had with the notif slots.
> 
> Instead of a new IORING_CQE_F_COPIED flag we could also set
> cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different.
> 
> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
> Can you be more verbose why you're thinking about something different?

Because it feels like something that should be done roughly once and in
advance. Performance wise, I agree that a bunch of extra instructions in
the (io_uring) IO path won't make difference as the net overhead is
already high, but I still prefer to keep it thin. The complexity is a
good point though, if only we could piggy back it onto MSG_PROBE.
Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.

First, there is no more ubuf_info::zerocopy, see for-next, but you can
grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.
You would want to take one io_uring patch I'm going to send (will CC
you), with that you won't need to change anything in net/. And the last
bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
I'll make it zero overhead when not set later by replacing the callback.
Stefan Metzmacher Oct. 19, 2022, 4:12 p.m. UTC | #4
Hi Pavel,

>> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
>> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
>> Can you be more verbose why you're thinking about something different?
> 
> Because it feels like something that should be done roughly once and in
> advance. Performance wise, I agree that a bunch of extra instructions in
> the (io_uring) IO path won't make difference as the net overhead is
> already high, but I still prefer to keep it thin. The complexity is a
> good point though, if only we could piggy back it onto MSG_PROBE.
> Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.

Thanks!

Experimenting with this stuff lets me wish to have a way to
have a different 'user_data' field for the notif cqe,
maybe based on a IORING_RECVSEND_ flag, it may make my life
easier and would avoid some complexity in userspace...
As I need to handle retry on short writes even with MSG_WAITALL
as EINTR and other errors could cause them.

What do you think?

> First, there is no more ubuf_info::zerocopy, see for-next, but you can
> grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.

Ok I found your "net: introduce struct ubuf_info_msgzc" and
"net: shrink struct ubuf_info" commits.

I think the change would be trivial, the zerocopy field would just move
to struct io_notif_data..., maybe as 'bool copied'.

> You would want to take one io_uring patch I'm going to send (will CC
> you), with that you won't need to change anything in net/.

The problem is that e.g. tcp_sendmsg_locked() won't ever call
the callback at all if 'zc' is false.

That's why there's the:

                         if (!zc)
                                 uarg->zerocopy = 0;

Maybe I can inverse the logic and use two variables 'zero_copied'
and 'copied'.

We'd start with both being false and this logic in the callback:

if (success) {
     if (unlikely(!nd->zero_copied && !nd->copied))
        nd->zero_copied = true;
} else {
     if (unlikely(!nd->copied)) {
        nd->copied = true;
        nd->zero_copied = false;
     }
}

And __io_notif_complete_tw still needs:

         if (!nd->zero_copied)
                 notif->cqe.flags |= IORING_CQE_F_COPIED;

instead of if (nd->copied)

> And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
> I'll make it zero overhead when not set later by replacing the callback.

And the if statement to select a highspeed callback based on
a IORING_RECVSEND_ flag is less overhead than
the if statements in the slow callback version?

metze
Pavel Begunkov Oct. 20, 2022, 2:24 a.m. UTC | #5
On 10/19/22 17:12, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>>> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
>>> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
>>> Can you be more verbose why you're thinking about something different?
>>
>> Because it feels like something that should be done roughly once and in
>> advance. Performance wise, I agree that a bunch of extra instructions in
>> the (io_uring) IO path won't make difference as the net overhead is
>> already high, but I still prefer to keep it thin. The complexity is a
>> good point though, if only we could piggy back it onto MSG_PROBE.
>> Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.
> 
> Thanks!
> 
> Experimenting with this stuff lets me wish to have a way to
> have a different 'user_data' field for the notif cqe,
> maybe based on a IORING_RECVSEND_ flag, it may make my life
> easier and would avoid some complexity in userspace...
> As I need to handle retry on short writes even with MSG_WAITALL
> as EINTR and other errors could cause them.
> 
> What do you think?
> 
>> First, there is no more ubuf_info::zerocopy, see for-next, but you can
>> grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.
> 
> Ok I found your "net: introduce struct ubuf_info_msgzc" and
> "net: shrink struct ubuf_info" commits.
> 
> I think the change would be trivial, the zerocopy field would just move
> to struct io_notif_data..., maybe as 'bool copied'.
> 
>> You would want to take one io_uring patch I'm going to send (will CC
>> you), with that you won't need to change anything in net/.
> 
> The problem is that e.g. tcp_sendmsg_locked() won't ever call
> the callback at all if 'zc' is false.
> 
> That's why there's the:
> 
>                          if (!zc)
>                                  uarg->zerocopy = 0;
> 
> Maybe I can inverse the logic and use two variables 'zero_copied'
> and 'copied'.
> 
> We'd start with both being false and this logic in the callback:> 
> if (success) {
>      if (unlikely(!nd->zero_copied && !nd->copied))
>         nd->zero_copied = true;
> } else {
>      if (unlikely(!nd->copied)) {
>         nd->copied = true;
>         nd->zero_copied = false;
>      }
> }

Yep, sth like that should do, but let's guard against
spurious net_zcopy_put() just in case.

used = false;
copied = false;

callback(skb, success, ubuf) {
	if (skb)
		used = true;
	if (!success)
		copied = true;
}
complete() {
	if (!used || copied)
		set_flag(IORING_CQE_F_COPIED);
}

> And __io_notif_complete_tw still needs:
> 
>          if (!nd->zero_copied)
>                  notif->cqe.flags |= IORING_CQE_F_COPIED;

Which can be shoved in a custom callback


>> And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
>> I'll make it zero overhead when not set later by replacing the callback.
> 
> And the if statement to select a highspeed callback based on
> a IORING_RECVSEND_ flag is less overhead than
> the if statements in the slow callback version?

I'm more concerned about future changes around it, but there won't
be extra ifs.

#define COMMON_FLAGS (RECVSEND_FIRST_POLL|...)
#define ALL_FLAGS (COMMON_FLAGS|RECVSEND_PROBE)

if (flags & ~COMMON_FLAGS) {
	if (flags & ~ALL_FLAGS)
		return err;
	if (flags & RECVSEND_PROBE)
		set_callback(notif);
}
Stefan Metzmacher Oct. 20, 2022, 10:10 a.m. UTC | #6
Hi Pavel,

>> Experimenting with this stuff lets me wish to have a way to
>> have a different 'user_data' field for the notif cqe,
>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>> easier and would avoid some complexity in userspace...
>> As I need to handle retry on short writes even with MSG_WAITALL
>> as EINTR and other errors could cause them.
>>
>> What do you think?

Any comment on this?

IORING_SEND_NOTIF_USER_DATA could let us use
notif->cqe.user_data = sqe->addr3;

metze
Pavel Begunkov Oct. 20, 2022, 3:37 p.m. UTC | #7
On 10/20/22 11:10, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>>> Experimenting with this stuff lets me wish to have a way to
>>> have a different 'user_data' field for the notif cqe,
>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>> easier and would avoid some complexity in userspace...
>>> As I need to handle retry on short writes even with MSG_WAITALL
>>> as EINTR and other errors could cause them.
>>>
>>> What do you think?
> 
> Any comment on this?
> 
> IORING_SEND_NOTIF_USER_DATA could let us use
> notif->cqe.user_data = sqe->addr3;

I'd rather not use the last available u64, tbh, that was the
reason for not adding a second user_data in the first place.

Maybe IORING_SETUP_SQE128?
Stefan Metzmacher Oct. 21, 2022, 8:32 a.m. UTC | #8
Hi Pavel,

>>>> Experimenting with this stuff lets me wish to have a way to
>>>> have a different 'user_data' field for the notif cqe,
>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>> easier and would avoid some complexity in userspace...
>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>> as EINTR and other errors could cause them.
>>>>
>>>> What do you think?
>>
>> Any comment on this?
>>
>> IORING_SEND_NOTIF_USER_DATA could let us use
>> notif->cqe.user_data = sqe->addr3;
> 
> I'd rather not use the last available u64, tbh, that was the
> reason for not adding a second user_data in the first place.

As far as I can see io_send_zc_prep has this:

         if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
                 return -EINVAL;

both are u64...

metze
Pavel Begunkov Oct. 21, 2022, 9:27 a.m. UTC | #9
On 10/21/22 09:32, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>> have a different 'user_data' field for the notif cqe,
>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>> easier and would avoid some complexity in userspace...
>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>> as EINTR and other errors could cause them.
>>>>>
>>>>> What do you think?
>>>
>>> Any comment on this?
>>>
>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>> notif->cqe.user_data = sqe->addr3;
>>
>> I'd rather not use the last available u64, tbh, that was the
>> reason for not adding a second user_data in the first place.
> 
> As far as I can see io_send_zc_prep has this:
> 
>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>                  return -EINVAL;
> 
> both are u64...

Hah, true, completely forgot about that one
Stefan Metzmacher Oct. 21, 2022, 9:45 a.m. UTC | #10
Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
> On 10/21/22 09:32, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>> easier and would avoid some complexity in userspace...
>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>> as EINTR and other errors could cause them.
>>>>>>
>>>>>> What do you think?
>>>>
>>>> Any comment on this?
>>>>
>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>> notif->cqe.user_data = sqe->addr3;
>>>
>>> I'd rather not use the last available u64, tbh, that was the
>>> reason for not adding a second user_data in the first place.
>>
>> As far as I can see io_send_zc_prep has this:
>>
>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>                  return -EINVAL;
>>
>> both are u64...
> 
> Hah, true, completely forgot about that one

So would a commit like below be fine for you?

Do you have anything in mind for SEND[MSG]_ZC that could possibly use
another u64 in future?

metze

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 738d6234d1d9..7a6272872334 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -300,6 +300,7 @@ enum io_uring_op {
  #define IORING_RECVSEND_POLL_FIRST	(1U << 0)
  #define IORING_RECV_MULTISHOT		(1U << 1)
  #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
+#define IORING_SEND_NOTIF_USER_DATA	(1U << 3)

  /*
   * accept flags stored in sqe->ioprio
diff --git a/io_uring/net.c b/io_uring/net.c
index 735eec545115..e1bc06b58cd7 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -938,7 +938,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  	struct io_ring_ctx *ctx = req->ctx;
  	struct io_kiocb *notif;

-	if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
+	if (unlikely(READ_ONCE(sqe->__pad2[0]))
  		return -EINVAL;
  	/* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
  	if (req->flags & REQ_F_CQE_SKIP)
@@ -946,12 +946,19 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)

  	zc->flags = READ_ONCE(sqe->ioprio);
  	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
-			  IORING_RECVSEND_FIXED_BUF))
+			  IORING_RECVSEND_FIXED_BUF |
+			  IORING_SEND_NOTIF_USER_DATA))
  		return -EINVAL;
  	notif = zc->notif = io_alloc_notif(ctx);
  	if (!notif)
  		return -ENOMEM;
-	notif->cqe.user_data = req->cqe.user_data;
+	if (zc->flags & IORING_SEND_NOTIF_USER_DATA)
+		notif->cqe.user_data = READ_ONCE(sqe->addr3);
+	else {
+		if (unlikely(READ_ONCE(sqe->addr3)))
+			return -EINVAL;
+		notif->cqe.user_data = req->cqe.user_data;
+	}
  	notif->cqe.res = 0;
  	notif->cqe.flags = IORING_CQE_F_NOTIF;
  	req->flags |= REQ_F_NEED_CLEANUP;
Stefan Metzmacher Oct. 21, 2022, 10:15 a.m. UTC | #11
Hi Pavel, and others...

>> As far as I can see io_send_zc_prep has this:
>>
>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>                  return -EINVAL;
>>
>> both are u64...
> 
> Hah, true, completely forgot about that one

BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout"
thread, that would make it much easier to figure out which fields are used..., see
https://lore.kernel.org/io-uring/cover.1660291547.git.metze@samba.org/#r

metze
Pavel Begunkov Oct. 21, 2022, 11:20 a.m. UTC | #12
On 10/21/22 10:45, Stefan Metzmacher wrote:
> Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
>> On 10/21/22 09:32, Stefan Metzmacher wrote:
>>> Hi Pavel,
>>>
>>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>>> easier and would avoid some complexity in userspace...
>>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>>> as EINTR and other errors could cause them.
>>>>>>>
>>>>>>> What do you think?
>>>>>
>>>>> Any comment on this?
>>>>>
>>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>>> notif->cqe.user_data = sqe->addr3;
>>>>
>>>> I'd rather not use the last available u64, tbh, that was the
>>>> reason for not adding a second user_data in the first place.
>>>
>>> As far as I can see io_send_zc_prep has this:
>>>
>>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>                  return -EINVAL;
>>>
>>> both are u64...
>>
>> Hah, true, completely forgot about that one
> 
> So would a commit like below be fine for you?
> 
> Do you have anything in mind for SEND[MSG]_ZC that could possibly use
> another u64 in future?

It'll most likely be taken in the future, some features are planned
some I can imagine. The question is how necessary this one is and/or
how much simpler it would make it considering that CQEs are ordered
and apps still need to check for F_MORE. It shouldn't even require
refcounting. Can you elaborate on the simplifying userspace part?
Pavel Begunkov Oct. 21, 2022, 11:26 a.m. UTC | #13
On 10/21/22 11:15, Stefan Metzmacher wrote:
> Hi Pavel, and others...
> 
>>> As far as I can see io_send_zc_prep has this:
>>>
>>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>                  return -EINVAL;
>>>
>>> both are u64...
>>
>> Hah, true, completely forgot about that one
> 
> BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout"
> thread, that would make it much easier to figure out which fields are used..., see
> https://lore.kernel.org/io-uring/cover.1660291547.git.metze@samba.org/#r

I admit the sqe layout is messy as there is no good separation b/w
common vs opcode specific fields, but it's not like the new layout
makes it much simpler. E.g. looking who is using a field will get
more complicated. iow, no strong opinion on it.

btw, will be happy to have the include guard patch from one of
your branches
Stefan Metzmacher Oct. 21, 2022, 12:10 p.m. UTC | #14
Am 21.10.22 um 13:20 schrieb Pavel Begunkov:
> On 10/21/22 10:45, Stefan Metzmacher wrote:
>> Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
>>> On 10/21/22 09:32, Stefan Metzmacher wrote:
>>>> Hi Pavel,
>>>>
>>>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>>>> easier and would avoid some complexity in userspace...
>>>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>>>> as EINTR and other errors could cause them.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>
>>>>>> Any comment on this?
>>>>>>
>>>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>>>> notif->cqe.user_data = sqe->addr3;
>>>>>
>>>>> I'd rather not use the last available u64, tbh, that was the
>>>>> reason for not adding a second user_data in the first place.
>>>>
>>>> As far as I can see io_send_zc_prep has this:
>>>>
>>>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>>                  return -EINVAL;
>>>>
>>>> both are u64...
>>>
>>> Hah, true, completely forgot about that one
>>
>> So would a commit like below be fine for you?
>>
>> Do you have anything in mind for SEND[MSG]_ZC that could possibly use
>> another u64 in future?
> 
> It'll most likely be taken in the future, some features are planned
> some I can imagine.

Can give examples? As I can't imagine any possible feature.

> The question is how necessary this one is and/or
> how much simpler it would make it considering that CQEs are ordered
> and apps still need to check for F_MORE. It shouldn't even require
> refcounting. Can you elaborate on the simplifying userspace part?
> 
It's not critical, it would just make it easier to dispatch
a different callback functions for the two cases.

The current problem I'm facing is that I have a structure
holding the state of an response and that has a single embedded
completion structure:

(simplified) struct completion {
    uint32_t generation;
    void (*callback_fn)(void *callback_private, const struct io_uring_cqe *cqe);
    void *callback_private;
};

I use the memory address of the completion structure glued with the lower bits of the generation
as 'user_data'. Imagine that I got a short write from SENDMSG_ZC/WAITALL
because EINTR was generated, then I need to retry from userspace, which
I'd try immediately without waiting for the NOTIF cqe to arrive.

For each incoming cqe I get the completion address and the generation
out of user_data and then verify the generation against the one stored in
the completion in order to detect bugs, before passing over to callback_fn().

Because I still need to handle the NOTIF cqe from the first try
I can't change the generation for the next try.

I thought about using two completion structures, one for the main SENDMSG_ZC result
(which gets its generation incremented with each retry) and one for the NOTIF cqes
just keeping generation stable having a simple callback_fn just waiting for a
refcount to get 0.

Most likely I just need to sit down concentrated to get the
recounting and similar things sorted out.

If there are really useful things we will do with addr3 and __pad2[0],
I can try to cope with it... It would just be sad if they wouldn't be used anyway.

metze
Stefan Metzmacher Oct. 21, 2022, 12:38 p.m. UTC | #15
Am 21.10.22 um 13:26 schrieb Pavel Begunkov:
> On 10/21/22 11:15, Stefan Metzmacher wrote:
>> Hi Pavel, and others...
>>
>>>> As far as I can see io_send_zc_prep has this:
>>>>
>>>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>>                  return -EINVAL;
>>>>
>>>> both are u64...
>>>
>>> Hah, true, completely forgot about that one
>>
>> BTW: any comment on my "[RFC PATCH 0/8] cleanup struct io_uring_sqe layout"
>> thread, that would make it much easier to figure out which fields are used..., see
>> https://lore.kernel.org/io-uring/cover.1660291547.git.metze@samba.org/#r
> 
> I admit the sqe layout is messy as there is no good separation b/w
> common vs opcode specific fields, but it's not like the new layout
> makes it much simpler.

Really?

> E.g. looking who is using a field will get more complicated.

Why should anyone care what fields other opcodes use
and how they are named.

For legacy reasons we have to live with
struct io_uring_sqe_common common; in the middle.
apart from that each opcode should be free to use
5 u64 fields and 1 u32 field.

But e.g.

+               /* IORING_OP_FALLOCATE */
+               struct io_uring_sqe_fallocate {
+                       struct io_uring_sqe_hdr hdr;
+
+                       __u64   offset;
+                       __u64   length;
+                       __u32   mode;
+                       __u32   u32_ofs28;
+
+                       struct io_uring_sqe_common common;
+
+                       __u32   u32_ofs44;
+                       __u64   u64_ofs48;
+                       __u64   u64_ofs56;
+               } fallocate;

Immediately shows what's used and what not
and it avoids brain dead things like using
sqe->addr instead of sqe->len for the length.

And it makes it trivial to verify that the _prep function
rejects any unused field.

And it would it easier to write per opcode tracing code,
which can be easily analyzed.

> iow, no strong opinion on it.
> 
> btw, will be happy to have the include guard patch from one of
> your branches

This one from the io_uring_livepatch.v6.1 branch?
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=3c36e05baad737f5cb896fdc9fc53dc1b74d2499

metze
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 04729989e6ee..efeab6a9b4f3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -341,6 +341,7 @@  struct io_uring_cqe {
  #define IORING_CQE_F_MORE		(1U << 1)
  #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
  #define IORING_CQE_F_NOTIF		(1U << 3)
+#define IORING_CQE_F_COPIED		(1U << 4)

  enum {
  	IORING_CQE_BUFFER_SHIFT		= 16,
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..2162d1af0b60 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -18,6 +18,8 @@  static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
  		__io_unaccount_mem(ctx->user, nd->account_pages);
  		nd->account_pages = 0;
  	}
+	if (!nd->uarg.zerocopy)
+		notif->cqe.flags |= IORING_CQE_F_COPIED;
  	io_req_task_complete(notif, locked);
  }

@@ -28,6 +30,8 @@  static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
  	struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
  	struct io_kiocb *notif = cmd_to_io_kiocb(nd);

+	uarg->zerocopy = uarg->zerocopy & success;
+
  	if (refcount_dec_and_test(&uarg->refcnt)) {
  		notif->io_task_work.func = __io_notif_complete_tw;
  		io_req_task_work_add(notif);
@@ -53,6 +57,7 @@  struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)

  	nd = io_notif_to_data(notif);
  	nd->account_pages = 0;
+	nd->uarg.zerocopy = 1;
  	nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
  	nd->uarg.callback = io_uring_tx_zerocopy_callback;
  	refcount_set(&nd->uarg.refcnt, 1);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04e2034f2f8e..64d263a8ece8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1032,7 +1032,8 @@  static int __ip_append_data(struct sock *sk,
  				paged = true;
  				zc = true;
  				uarg = msg->msg_ubuf;
-			}
+			} else
+				msg->msg_ubuf->zerocopy = 0;
  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
  			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
  			if (!uarg)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cdf26724d7db..d3a2ed9f22df 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1247,6 +1247,8 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
  			uarg = msg->msg_ubuf;
  			net_zcopy_get(uarg);
  			zc = sk->sk_route_caps & NETIF_F_SG;
+			if (!zc)
+				uarg->zerocopy = 0;
  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
  			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
  			if (!uarg) {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index bb0f469a5247..3d75dd05ff98 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1559,7 +1559,8 @@  static int __ip6_append_data(struct sock *sk,
  				paged = true;
  				zc = true;
  				uarg = msg->msg_ubuf;
-			}
+			} else
+				msg->msg_ubuf->zerocopy = 0;
  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
  			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
  			if (!uarg)