diff mbox series

IORING_CQE_F_COPIED

Message ID 4385ba84-55dd-6b08-0ca7-6b4a43f9d9a2@samba.org (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series IORING_CQE_F_COPIED | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Stefan Metzmacher Oct. 14, 2022, 11:06 a.m. UTC
Hi Pavel,

In the tests I made I used this version of IORING_CQE_F_COPIED:
https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
(also inlined at the end)

Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)

metze

  include/uapi/linux/io_uring.h | 1 +
  io_uring/notif.c              | 5 +++++
  net/ipv4/ip_output.c          | 3 ++-
  net/ipv4/tcp.c                | 2 ++
  net/ipv6/ip6_output.c         | 3 ++-
  5 files changed, 12 insertions(+), 2 deletions(-)

Comments

Pavel Begunkov Oct. 17, 2022, 4:46 p.m. UTC | #1
Hey Stefan,

On 10/14/22 12:06, Stefan Metzmacher wrote:
> Hi Pavel,
> 
> In the tests I made I used this version of IORING_CQE_F_COPIED:
> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
> (also inlined at the end)
> 
> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)

I was thinking, can it be delivered separately but not in the same cqe?
The intention is to keep it off the IO path. For example, it can emit a
zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
options? And we can add a separate callback for that, will make a couple
of things better.

What do you think? Especially from the userspace usability perspective.
Stefan Metzmacher Oct. 18, 2022, 8:43 a.m. UTC | #2
Hi Pavel,

> On 10/14/22 12:06, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>> In the tests I made I used this version of IORING_CQE_F_COPIED:
>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
>> (also inlined at the end)
>>
>> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
> 
> I was thinking, can it be delivered separately but not in the same cqe?
> The intention is to keep it off the IO path. For example, it can emit a
> zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
> options? And we can add a separate callback for that, will make a couple
> of things better.
> 
> What do you think? Especially from the userspace usability perspective.

So far I can't think of any other way that would be useful yet,
but that doesn't mean something else might exist...

IORING_CQE_F_COPIED is available per request and makes it possible
to judge why the related SENDMSG_ZC was fast or not.
It's also available in trace-cmd report.

Everything else would likely re-introduce similar complexity like we
had with the notif slots.

Instead of a new IORING_CQE_F_COPIED flag we could also set
cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different.

As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
Can you be more verbose why you're thinking about something different?

metze
Pavel Begunkov Oct. 19, 2022, 3:06 p.m. UTC | #3
On 10/18/22 09:43, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>> On 10/14/22 12:06, Stefan Metzmacher wrote:
>>> Hi Pavel,
>>>
>>> In the tests I made I used this version of IORING_CQE_F_COPIED:
>>> https://git.samba.org/?p=metze/linux/wip.git;a=commitdiff;h=645d3b584c417a247d92d71baa6266a5f3d0d17d
>>> (also inlined at the end)
>>>
>>> Would that something we want for 6.1? (Should I post that with a useful commit message, after doing some more tests)
>>
>> I was thinking, can it be delivered separately but not in the same cqe?
>> The intention is to keep it off the IO path. For example, it can emit a
>> zc status CQE or maybe keep a "zc failed" counter inside the ring. Other
>> options? And we can add a separate callback for that, will make a couple
>> of things better.
>>
>> What do you think? Especially from the userspace usability perspective.
> 
> So far I can't think of any other way that would be useful yet,
> but that doesn't mean something else might exist...
> 
> IORING_CQE_F_COPIED is available per request and makes it possible
> to judge why the related SENDMSG_ZC was fast or not.
> It's also available in trace-cmd report.
> 
> Everything else would likely re-introduce similar complexity like we
> had with the notif slots.
> 
> Instead of a new IORING_CQE_F_COPIED flag we could also set
> cqe.res = SO_EE_CODE_ZEROCOPY_COPIED, but that isn't really different.
> 
> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
> Can you be more verbose why you're thinking about something different?

Because it feels like something that should be done roughly once and in
advance. Performance wise, I agree that a bunch of extra instructions in
the (io_uring) IO path won't make difference as the net overhead is
already high, but I still prefer to keep it thin. The complexity is a
good point though, if only we could piggy back it onto MSG_PROBE.
Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.

First, there is no more ubuf_info::zerocopy, see for-next, but you can
grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.
You would want to take one io_uring patch I'm going to send (will CC
you), with that you won't need to change anything in net/. And the last
bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
I'll make it zero overhead when not set later by replacing the callback.
Stefan Metzmacher Oct. 19, 2022, 4:12 p.m. UTC | #4
Hi Pavel,

>> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
>> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
>> Can you be more verbose why you're thinking about something different?
> 
> Because it feels like something that should be done roughly once and in
> advance. Performance wise, I agree that a bunch of extra instructions in
> the (io_uring) IO path won't make difference as the net overhead is
> already high, but I still prefer to keep it thin. The complexity is a
> good point though, if only we could piggy back it onto MSG_PROBE.
> Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.

Thanks!

Experimenting with this stuff lets me wish to have a way to
have a different 'user_data' field for the notif cqe,
maybe based on a IORING_RECVSEND_ flag, it may make my life
easier and would avoid some complexity in userspace...
As I need to handle retry on short writes even with MSG_WAITALL
as EINTR and other errors could cause them.

What do you think?

> First, there is no more ubuf_info::zerocopy, see for-next, but you can
> grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.

Ok I found your "net: introduce struct ubuf_info_msgzc" and
"net: shrink struct ubuf_info" commits.

I think the change would be trivial, the zerocopy field would just move
to struct io_notif_data..., maybe as 'bool copied'.

> You would want to take one io_uring patch I'm going to send (will CC
> you), with that you won't need to change anything in net/.

The problem is that e.g. tcp_sendmsg_locked() won't ever call
the callback at all if 'zc' is false.

That's why there's the:

                         if (!zc)
                                 uarg->zerocopy = 0;

Maybe I can inverse the logic and use two variables 'zero_copied'
and 'copied'.

We'd start with both being false and this logic in the callback:

if (success) {
     if (unlikely(!nd->zero_copied && !nd->copied))
        nd->zero_copied = true;
} else {
     if (unlikely(!nd->copied)) {
        nd->copied = true;
        nd->zero_copied = false;
     }
}

And __io_notif_complete_tw still needs:

         if (!nd->zero_copied)
                 notif->cqe.flags |= IORING_CQE_F_COPIED;

instead of if (nd->copied)

> And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
> I'll make it zero overhead when not set later by replacing the callback.

And the if statement to select a highspeed callback based on
a IORING_RECVSEND_ flag is less overhead than
the if statements in the slow callback version?

metze
Pavel Begunkov Oct. 20, 2022, 2:24 a.m. UTC | #5
On 10/19/22 17:12, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>>> As I basically use the same logic that's used to generate SO_EE_CODE_ZEROCOPY_COPIED
>>> for the native MSG_ZEROCOPY, I don't see the problem with IORING_CQE_F_COPIED.
>>> Can you be more verbose why you're thinking about something different?
>>
>> Because it feels like something that should be done roughly once and in
>> advance. Performance wise, I agree that a bunch of extra instructions in
>> the (io_uring) IO path won't make difference as the net overhead is
>> already high, but I still prefer to keep it thin. The complexity is a
>> good point though, if only we could piggy back it onto MSG_PROBE.
>> Ok, let's do IORING_CQE_F_COPIED and aim 6.2 + possibly backport.
> 
> Thanks!
> 
> Experimenting with this stuff lets me wish to have a way to
> have a different 'user_data' field for the notif cqe,
> maybe based on a IORING_RECVSEND_ flag, it may make my life
> easier and would avoid some complexity in userspace...
> As I need to handle retry on short writes even with MSG_WAITALL
> as EINTR and other errors could cause them.
> 
> What do you think?
> 
>> First, there is no more ubuf_info::zerocopy, see for-next, but you can
>> grab space in io_kiocb, io_kiocb::iopoll_completed is a good candidate.
> 
> Ok I found your "net: introduce struct ubuf_info_msgzc" and
> "net: shrink struct ubuf_info" commits.
> 
> I think the change would be trivial, the zerocopy field would just move
> to struct io_notif_data..., maybe as 'bool copied'.
> 
>> You would want to take one io_uring patch I'm going to send (will CC
>> you), with that you won't need to change anything in net/.
> 
> The problem is that e.g. tcp_sendmsg_locked() won't ever call
> the callback at all if 'zc' is false.
> 
> That's why there's the:
> 
>                          if (!zc)
>                                  uarg->zerocopy = 0;
> 
> Maybe I can inverse the logic and use two variables 'zero_copied'
> and 'copied'.
> 
> We'd start with both being false and this logic in the callback:> 
> if (success) {
>      if (unlikely(!nd->zero_copied && !nd->copied))
>         nd->zero_copied = true;
> } else {
>      if (unlikely(!nd->copied)) {
>         nd->copied = true;
>         nd->zero_copied = false;
>      }
> }

Yep, sth like that should do, but let's guard against
spurious net_zcopy_put() just in case.

used = false;
copied = false;

callback(skb, success, ubuf) {
	if (skb)
		used = true;
	if (!success)
		copied = true;
}
complete() {
	if (!used || copied)
		set_flag(IORING_CQE_F_COPIED);
}

> And __io_notif_complete_tw still needs:
> 
>          if (!nd->zero_copied)
>                  notif->cqe.flags |= IORING_CQE_F_COPIED;

Which can be shoved in a custom callback


>> And the last bit, let's make the zc probing conditional under IORING_RECVSEND_* flag,
>> I'll make it zero overhead when not set later by replacing the callback.
> 
> And the if statement to select a highspeed callback based on
> a IORING_RECVSEND_ flag is less overhead than
> the if statements in the slow callback version?

I'm more concerned about future changes around it, but there won't
be extra ifs.

#define COMMON_FLAGS (RECVSEND_FIRST_POLL|...)
#define ALL_FLAGS (COMMON_FLAGS|RECVSEND_PROBE)

if (flags & ~COMMON_FLAGS) {
	if (flags & ~ALL_FLAGS)
		return err;
	if (flags & RECVSEND_PROBE)
		set_callback(notif);
}
Stefan Metzmacher Oct. 20, 2022, 10:10 a.m. UTC | #6
Hi Pavel,

>> Experimenting with this stuff lets me wish to have a way to
>> have a different 'user_data' field for the notif cqe,
>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>> easier and would avoid some complexity in userspace...
>> As I need to handle retry on short writes even with MSG_WAITALL
>> as EINTR and other errors could cause them.
>>
>> What do you think?

Any comment on this?

IORING_SEND_NOTIF_USER_DATA could let us use
notif->cqe.user_data = sqe->addr3;

metze
Pavel Begunkov Oct. 20, 2022, 3:37 p.m. UTC | #7
On 10/20/22 11:10, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>>> Experimenting with this stuff lets me wish to have a way to
>>> have a different 'user_data' field for the notif cqe,
>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>> easier and would avoid some complexity in userspace...
>>> As I need to handle retry on short writes even with MSG_WAITALL
>>> as EINTR and other errors could cause them.
>>>
>>> What do you think?
> 
> Any comment on this?
> 
> IORING_SEND_NOTIF_USER_DATA could let us use
> notif->cqe.user_data = sqe->addr3;

I'd rather not use the last available u64, tbh, that was the
reason for not adding a second user_data in the first place.

Maybe IORING_SETUP_SQE128?
Stefan Metzmacher Oct. 21, 2022, 8:32 a.m. UTC | #8
Hi Pavel,

>>>> Experimenting with this stuff lets me wish to have a way to
>>>> have a different 'user_data' field for the notif cqe,
>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>> easier and would avoid some complexity in userspace...
>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>> as EINTR and other errors could cause them.
>>>>
>>>> What do you think?
>>
>> Any comment on this?
>>
>> IORING_SEND_NOTIF_USER_DATA could let us use
>> notif->cqe.user_data = sqe->addr3;
> 
> I'd rather not use the last available u64, tbh, that was the
> reason for not adding a second user_data in the first place.

As far as I can see io_send_zc_prep has this:

         if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
                 return -EINVAL;

both are u64...

metze
Pavel Begunkov Oct. 21, 2022, 9:27 a.m. UTC | #9
On 10/21/22 09:32, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>> have a different 'user_data' field for the notif cqe,
>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>> easier and would avoid some complexity in userspace...
>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>> as EINTR and other errors could cause them.
>>>>>
>>>>> What do you think?
>>>
>>> Any comment on this?
>>>
>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>> notif->cqe.user_data = sqe->addr3;
>>
>> I'd rather not use the last available u64, tbh, that was the
>> reason for not adding a second user_data in the first place.
> 
> As far as I can see io_send_zc_prep has this:
> 
>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>                  return -EINVAL;
> 
> both are u64...

Hah, true, completely forgot about that one
Stefan Metzmacher Oct. 21, 2022, 9:45 a.m. UTC | #10
Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
> On 10/21/22 09:32, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>> easier and would avoid some complexity in userspace...
>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>> as EINTR and other errors could cause them.
>>>>>>
>>>>>> What do you think?
>>>>
>>>> Any comment on this?
>>>>
>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>> notif->cqe.user_data = sqe->addr3;
>>>
>>> I'd rather not use the last available u64, tbh, that was the
>>> reason for not adding a second user_data in the first place.
>>
>> As far as I can see io_send_zc_prep has this:
>>
>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>                  return -EINVAL;
>>
>> both are u64...
> 
> Hah, true, completely forgot about that one

So would a commit like below be fine for you?

Do you have anything in mind for SEND[MSG]_ZC that could possibly use
another u64 in future?

metze

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 738d6234d1d9..7a6272872334 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -300,6 +300,7 @@ enum io_uring_op {
  #define IORING_RECVSEND_POLL_FIRST	(1U << 0)
  #define IORING_RECV_MULTISHOT		(1U << 1)
  #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
+#define IORING_SEND_NOTIF_USER_DATA	(1U << 3)

  /*
   * accept flags stored in sqe->ioprio
diff --git a/io_uring/net.c b/io_uring/net.c
index 735eec545115..e1bc06b58cd7 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -938,7 +938,7 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  	struct io_ring_ctx *ctx = req->ctx;
  	struct io_kiocb *notif;

-	if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
+	if (unlikely(READ_ONCE(sqe->__pad2[0]))
  		return -EINVAL;
  	/* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
  	if (req->flags & REQ_F_CQE_SKIP)
@@ -946,12 +946,19 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)

  	zc->flags = READ_ONCE(sqe->ioprio);
  	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
-			  IORING_RECVSEND_FIXED_BUF))
+			  IORING_RECVSEND_FIXED_BUF |
+			  IORING_SEND_NOTIF_USER_DATA))
  		return -EINVAL;
  	notif = zc->notif = io_alloc_notif(ctx);
  	if (!notif)
  		return -ENOMEM;
-	notif->cqe.user_data = req->cqe.user_data;
+	if (zc->flags & IORING_SEND_NOTIF_USER_DATA)
+		notif->cqe.user_data = READ_ONCE(sqe->addr3);
+	else {
+		if (unlikely(READ_ONCE(sqe->addr3)))
+			return -EINVAL;
+		notif->cqe.user_data = req->cqe.user_data;
+	}
  	notif->cqe.res = 0;
  	notif->cqe.flags = IORING_CQE_F_NOTIF;
  	req->flags |= REQ_F_NEED_CLEANUP;
Pavel Begunkov Oct. 21, 2022, 11:20 a.m. UTC | #11
On 10/21/22 10:45, Stefan Metzmacher wrote:
> Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
>> On 10/21/22 09:32, Stefan Metzmacher wrote:
>>> Hi Pavel,
>>>
>>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>>> easier and would avoid some complexity in userspace...
>>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>>> as EINTR and other errors could cause them.
>>>>>>>
>>>>>>> What do you think?
>>>>>
>>>>> Any comment on this?
>>>>>
>>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>>> notif->cqe.user_data = sqe->addr3;
>>>>
>>>> I'd rather not use the last available u64, tbh, that was the
>>>> reason for not adding a second user_data in the first place.
>>>
>>> As far as I can see io_send_zc_prep has this:
>>>
>>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>                  return -EINVAL;
>>>
>>> both are u64...
>>
>> Hah, true, completely forgot about that one
> 
> So would a commit like below be fine for you?
> 
> Do you have anything in mind for SEND[MSG]_ZC that could possibly use
> another u64 in future?

It'll most likely be taken in the future, some features are planned
some I can imagine. The question is how necessary this one is and/or
how much simpler it would make it considering that CQEs are ordered
and apps still need to check for F_MORE. It shouldn't even require
refcounting. Can you elaborate on the simplifying userspace part?
Stefan Metzmacher Oct. 21, 2022, 12:10 p.m. UTC | #12
Am 21.10.22 um 13:20 schrieb Pavel Begunkov:
> On 10/21/22 10:45, Stefan Metzmacher wrote:
>> Am 21.10.22 um 11:27 schrieb Pavel Begunkov:
>>> On 10/21/22 09:32, Stefan Metzmacher wrote:
>>>> Hi Pavel,
>>>>
>>>>>>>> Experimenting with this stuff lets me wish to have a way to
>>>>>>>> have a different 'user_data' field for the notif cqe,
>>>>>>>> maybe based on a IORING_RECVSEND_ flag, it may make my life
>>>>>>>> easier and would avoid some complexity in userspace...
>>>>>>>> As I need to handle retry on short writes even with MSG_WAITALL
>>>>>>>> as EINTR and other errors could cause them.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>
>>>>>> Any comment on this?
>>>>>>
>>>>>> IORING_SEND_NOTIF_USER_DATA could let us use
>>>>>> notif->cqe.user_data = sqe->addr3;
>>>>>
>>>>> I'd rather not use the last available u64, tbh, that was the
>>>>> reason for not adding a second user_data in the first place.
>>>>
>>>> As far as I can see io_send_zc_prep has this:
>>>>
>>>>          if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
>>>>                  return -EINVAL;
>>>>
>>>> both are u64...
>>>
>>> Hah, true, completely forgot about that one
>>
>> So would a commit like below be fine for you?
>>
>> Do you have anything in mind for SEND[MSG]_ZC that could possibly use
>> another u64 in future?
> 
> It'll most likely be taken in the future, some features are planned
> some I can imagine.

Can give examples? As I can't imagine any possible feature.

> The question is how necessary this one is and/or
> how much simpler it would make it considering that CQEs are ordered
> and apps still need to check for F_MORE. It shouldn't even require
> refcounting. Can you elaborate on the simplifying userspace part?
> 
It's not critical, it would just make it easier to dispatch
a different callback functions for the two cases.

The current problem I'm facing is that I have a structure
holding the state of an response and that has a single embedded
completion structure:

(simplified) struct completion {
    uint32_t generation;
    void (*callback_fn)(void *callback_private, const struct io_uring_cqe *cqe);
    void *callback_private;
};

I use the memory address of the completion structure glued with the lower bits of the generation
as 'user_data'. Imagine that I got a short write from SENDMSG_ZC/WAITALL
because EINTR was generated, then I need to retry from userspace, which
I'd try immediately without waiting for the NOTIF cqe to arrive.

For each incoming cqe I get the completion address and the generation
out of user_data and then verify the generation against the one stored in
the completion in order to detect bugs, before passing over to callback_fn().

Because I still need to handle the NOTIF cqe from the first try
I can't change the generation for the next try.

I thought about using two completion structures, one for the main SENDMSG_ZC result
(which gets its generation incremented with each retry) and one for the NOTIF cqes
just keeping generation stable having a simple callback_fn just waiting for a
refcount to get 0.

Most likely I just need to sit down concentrated to get the
recounting and similar things sorted out.

If there are really useful things we will do with addr3 and __pad2[0],
I can try to cope with it... It would just be sad if they wouldn't be used anyway.

metze
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 04729989e6ee..efeab6a9b4f3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -341,6 +341,7 @@  struct io_uring_cqe {
  #define IORING_CQE_F_MORE		(1U << 1)
  #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
  #define IORING_CQE_F_NOTIF		(1U << 3)
+#define IORING_CQE_F_COPIED		(1U << 4)

  enum {
  	IORING_CQE_BUFFER_SHIFT		= 16,
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..2162d1af0b60 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -18,6 +18,8 @@  static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
  		__io_unaccount_mem(ctx->user, nd->account_pages);
  		nd->account_pages = 0;
  	}
+	if (!nd->uarg.zerocopy)
+		notif->cqe.flags |= IORING_CQE_F_COPIED;
  	io_req_task_complete(notif, locked);
  }

@@ -28,6 +30,8 @@  static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
  	struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
  	struct io_kiocb *notif = cmd_to_io_kiocb(nd);

+	uarg->zerocopy = uarg->zerocopy & success;
+
  	if (refcount_dec_and_test(&uarg->refcnt)) {
  		notif->io_task_work.func = __io_notif_complete_tw;
  		io_req_task_work_add(notif);
@@ -53,6 +57,7 @@  struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)

  	nd = io_notif_to_data(notif);
  	nd->account_pages = 0;
+	nd->uarg.zerocopy = 1;
  	nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
  	nd->uarg.callback = io_uring_tx_zerocopy_callback;
  	refcount_set(&nd->uarg.refcnt, 1);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04e2034f2f8e..64d263a8ece8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1032,7 +1032,8 @@  static int __ip_append_data(struct sock *sk,
  				paged = true;
  				zc = true;
  				uarg = msg->msg_ubuf;
-			}
+			} else
+				msg->msg_ubuf->zerocopy = 0;
  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
  			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
  			if (!uarg)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cdf26724d7db..d3a2ed9f22df 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1247,6 +1247,8 @@  int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
  			uarg = msg->msg_ubuf;
  			net_zcopy_get(uarg);
  			zc = sk->sk_route_caps & NETIF_F_SG;
+			if (!zc)
+				uarg->zerocopy = 0;
  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
  			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
  			if (!uarg) {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index bb0f469a5247..3d75dd05ff98 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1559,7 +1559,8 @@  static int __ip6_append_data(struct sock *sk,
  				paged = true;
  				zc = true;
  				uarg = msg->msg_ubuf;
-			}
+			} else
+				msg->msg_ubuf->zerocopy = 0;
  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
  			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
  			if (!uarg)