diff mbox series

[RFC,net-next,v3,23/29] io_uring: allow to pass addr into sendzc

Message ID 228d4841af5eeb9a4b73955136559f18cb7e43a0.1653992701.git.asml.silence@gmail.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series io_uring zerocopy send | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/apply fail Patch does not apply to net-next

Commit Message

Pavel Begunkov June 28, 2022, 6:56 p.m. UTC
Allow to specify an address to zerocopy sends making it more like
sendto(2).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 16 +++++++++++++++-
 include/uapi/linux/io_uring.h |  2 +-
 2 files changed, 16 insertions(+), 2 deletions(-)

Comments

Stefan Metzmacher June 29, 2022, 7:42 a.m. UTC | #1
Hi Pavel,

>   
> +	if (zc->addr) {
> +		ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
> +		if (unlikely(ret < 0))
> +			return ret;
> +		msg.msg_name = (struct sockaddr *)&address;
> +		msg.msg_namelen = zc->addr_len;
> +	}
> +

Given that this fills in msg almost completely can we also have
a version of SENDMSGZC, it would be very useful to also allow
msg_control to be passed and as well as an iovec.

Would that be possible?

Do I understand it correctly, that the reason for the new opcode is,
that IO_OP_SEND would already work with existing MSG_ZEROCOPY behavior, together
with the recvmsg based completion?

In addition I wondering if a completion based on msg_iocb->ki_complete() (indicated by EIOCBQUEUED)
what have also worked, just deferring the whole sendmsg operation until all buffers are no longer used.
That way it would be possible to buffers are acked by the remote end when it comes back to the application
layer.

I'm also wondering if the ki_complete() based approach should always be provided to sock_sendmsg()
triggered by io_uring (independend of the new zerocopy stuff), it would basically work very simular to
the uring_cmd() completions, which are able to handle both true async operation indicated by EIOCBQUEUED
as well as EAGAIN triggered path via io-wq.

metze
Pavel Begunkov June 29, 2022, 9:53 a.m. UTC | #2
On 6/29/22 08:42, Stefan Metzmacher wrote:
> 
> Hi Pavel,
> 
>> +    if (zc->addr) {
>> +        ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
>> +        if (unlikely(ret < 0))
>> +            return ret;
>> +        msg.msg_name = (struct sockaddr *)&address;
>> +        msg.msg_namelen = zc->addr_len;
>> +    }
>> +
> 
> Given that this fills in msg almost completely can we also have
> a version of SENDMSGZC, it would be very useful to also allow
> msg_control to be passed and as well as an iovec.
> 
> Would that be possible?

Right, I left it to follow ups as the series is already too long.

fwiw, I'm going to also add addr to IORING_OP_SEND.


> Do I understand it correctly, that the reason for the new opcode is,
> that IO_OP_SEND would already work with existing MSG_ZEROCOPY behavior, together
> with the recvmsg based completion?

Right, it should work with MSG_ZEROCOPY, but with a different notification
semantics, would need recvmsg from error queues, and with performance
implications.


> In addition I wondering if a completion based on msg_iocb->ki_complete() (indicated by EIOCBQUEUED)
> what have also worked, just deferring the whole sendmsg operation until all buffers are no longer used.
> That way it would be possible to buffers are acked by the remote end when it comes back to the application
> layer.

There is msg_iocb, but it's mostly unused by protocols, IIRC apart
from crypto sockets. And then we'd need to repeat the path of
ubuf_info to handle stuff like skb splitting and perhaps also
changing rules for ->ki_complete


> I'm also wondering if the ki_complete() based approach should always be provided to sock_sendmsg()
> triggered by io_uring (independend of the new zerocopy stuff), it would basically work very simular to
> the uring_cmd() completions, which are able to handle both true async operation indicated by EIOCBQUEUED
> as well as EAGAIN triggered path via io-wq.

Would be even more similar to how we has always been doing
read/write, and rw requests do pass in a msg_iocb, but again,
it's largely ignored internally.
Stefan Metzmacher Aug. 13, 2022, 8:45 a.m. UTC | #3
Hi Pavel,

>> Given that this fills in msg almost completely can we also have
>> a version of SENDMSGZC, it would be very useful to also allow
>> msg_control to be passed and as well as an iovec.
>>
>> Would that be possible?
> 
> Right, I left it to follow ups as the series is already too long.
> 
> fwiw, I'm going to also add addr to IORING_OP_SEND.


Given the minimal differences, which were left between
IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
as indication to use the notif slot.

It would means we don't need to waste two opcodes for
IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)


I also noticed a problem in io_notif_update()

         for (; idx < idx_end; idx++) {
                 struct io_notif_slot *slot = &ctx->notif_slots[idx];

                 if (!slot->notif)
                         continue;
                 if (up->arg)
                         slot->tag = up->arg;
                 io_notif_slot_flush_submit(slot, issue_flags);
         }

  slot->tag = up->arg is skipped if there is no notif already.

So you can't just use a 2 linked sqe's with

IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)

I think the if (!slot->notif) should be moved down a bit.

It would somehow be nice to avoid the notif slots at all and somehow
use some kind of multishot request in order to generate two qces.

I'm also wondering what will happen if a notif will be referenced by the net layer
but the io_uring instance is already closed, wouldn't
io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
because notif->ctx is a stale pointer, of notif itself is already gone...

What do you think?

metze
Pavel Begunkov Aug. 15, 2022, 9:46 a.m. UTC | #4
On 8/13/22 09:45, Stefan Metzmacher wrote:
> Hi Pavel,

Hi Stefan,

Thanks for giving a thought about the API, are you trying
to use it in samba?

>>> Given that this fills in msg almost completely can we also have
>>> a version of SENDMSGZC, it would be very useful to also allow
>>> msg_control to be passed and as well as an iovec.
>>>
>>> Would that be possible?
>>
>> Right, I left it to follow ups as the series is already too long.
>>
>> fwiw, I'm going to also add addr to IORING_OP_SEND.
> 
> 
> Given the minimal differences, which were left between
> IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
> to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
> as indication to use the notif slot.

And will be even more similar in for-next, but with notifications
I'd still prefer different opcodes to get a little bit more
flexibility and not making the normal io_uring send path messier.

> It would means we don't need to waste two opcodes for
> IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)
> 
> 
> I also noticed a problem in io_notif_update()
> 
>          for (; idx < idx_end; idx++) {
>                  struct io_notif_slot *slot = &ctx->notif_slots[idx];
> 
>                  if (!slot->notif)
>                          continue;
>                  if (up->arg)
>                          slot->tag = up->arg;
>                  io_notif_slot_flush_submit(slot, issue_flags);
>          }
> 
>   slot->tag = up->arg is skipped if there is no notif already.
> 
> So you can't just use a 2 linked sqe's with
> 
> IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)

slot->notif is lazily initialised with the first send attached to it,
so in your example IORING_OP_SENDZC will first create a notification
to execute the send and then will flush it.

This "if" is there is only to have a more reliable API. We can
go over the range and allocate all empty slots and then flush
all of them, but allocation failures should be propagated to the
userspace when currently the function it can't fail.

> I think the if (!slot->notif) should be moved down a bit.

Not sure what you mean

> It would somehow be nice to avoid the notif slots at all and somehow
> use some kind of multishot request in order to generate two qces.

It is there first to ammortise overhead of zerocopy infra and bits
for second CQE posting. But more importantly, without it for TCP
the send payload size would need to be large enough or performance
would suffer, but all depends on the use case. TL;DR; it would be
forced to create a new SKB for each new send.

For something simpler, I'll push another zc variant that doesn't
have notifiers and posts only one CQE and only after the buffers
are no more in use by the kernel. This works well for UDP and for
some TCP scenarios, but doesn't cover all cases.
> I'm also wondering what will happen if a notif will be referenced by the net layer
> but the io_uring instance is already closed, wouldn't
> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
> because notif->ctx is a stale pointer, of notif itself is already gone...

io_uring will flush all slots and wait for all notifications
to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
problem.
Stefan Metzmacher Aug. 15, 2022, 11:40 a.m. UTC | #5
Hi Pavel,

> Thanks for giving a thought about the API, are you trying
> to use it in samba?

Yes, but I'd need SENDMSGZC and then I'd like to test,
which variant gives the best performance. It also depends
on the configured samba vfs module stack.

My current prototype uses IO_SENDMSG for the header < 250 bytes
followed by up to 8MBytes via IO_SPLICE if the storage backend also
supports splice, otherwise I'd try to use IO_SENDMSGZC for header + 8 MBytes payload
together. If there's encryption turned actice on the connection we would
most likely always use a bounce buffer and hit the IO_SENDMSGZC case.
So all in all I'd say we'll use it.

I guess it would be useful for userspace to notice if zero was possible or not.

__msg_zerocopy_callback() sets SO_EE_CODE_ZEROCOPY_COPIED, maybe
io_uring_tx_zerocopy_callback() should have something like:

if (!success)
     notif->cqe.res = SO_EE_CODE_ZEROCOPY_COPIED;

This would make it a bit easier to judge if SENDZC is useful for the
application or not. Or at least have debug message, which would explain
be able to explain degraded performance to the admin/developer.

>>>> Given that this fills in msg almost completely can we also have
>>>> a version of SENDMSGZC, it would be very useful to also allow
>>>> msg_control to be passed and as well as an iovec.
>>>>
>>>> Would that be possible?
>>>
>>> Right, I left it to follow ups as the series is already too long.
>>>
>>> fwiw, I'm going to also add addr to IORING_OP_SEND.
>>
>>
>> Given the minimal differences, which were left between
>> IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
>> to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
>> as indication to use the notif slot.
> 
> And will be even more similar in for-next, but with notifications
> I'd still prefer different opcodes to get a little bit more
> flexibility and not making the normal io_uring send path messier.

Ok, we should just remember the opcode is only u8
and we already have ~ 50 out of ~250 allocated in ~3 years
time.

>> It would means we don't need to waste two opcodes for
>> IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)
>>
>>
>> I also noticed a problem in io_notif_update()
>>
>>          for (; idx < idx_end; idx++) {
>>                  struct io_notif_slot *slot = &ctx->notif_slots[idx];
>>
>>                  if (!slot->notif)
>>                          continue;
>>                  if (up->arg)
>>                          slot->tag = up->arg;
>>                  io_notif_slot_flush_submit(slot, issue_flags);
>>          }
>>
>>   slot->tag = up->arg is skipped if there is no notif already.
>>
>> So you can't just use a 2 linked sqe's with
>>
>> IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)
> 
> slot->notif is lazily initialised with the first send attached to it,
> so in your example IORING_OP_SENDZC will first create a notification
> to execute the send and then will flush it.
> 
> This "if" is there is only to have a more reliable API. We can
> go over the range and allocate all empty slots and then flush
> all of them, but allocation failures should be propagated to the
> userspace when currently the function it can't fail.
> 
>> I think the if (!slot->notif) should be moved down a bit.
> 
> Not sure what you mean

I think it should be:

                   if (up->arg)
                           slot->tag = up->arg;
                   if (!slot->notif)
                           continue;
                   io_notif_slot_flush_submit(slot, issue_flags);

or even:

                   slot->tag = up->arg;
                   if (!slot->notif)
                           continue;
                   io_notif_slot_flush_submit(slot, issue_flags);

otherwise IORING_RSRC_UPDATE_NOTIF would not be able to reset the tag,
if notif was never created or already be flushed.

>> It would somehow be nice to avoid the notif slots at all and somehow
>> use some kind of multishot request in order to generate two qces.
> 
> It is there first to ammortise overhead of zerocopy infra and bits
> for second CQE posting. But more importantly, without it for TCP
> the send payload size would need to be large enough or performance
> would suffer, but all depends on the use case. TL;DR; it would be
> forced to create a new SKB for each new send.
> 
> For something simpler, I'll push another zc variant that doesn't
> have notifiers and posts only one CQE and only after the buffers
> are no more in use by the kernel. This works well for UDP and for
> some TCP scenarios, but doesn't cover all cases.

I think (at least for stream sockets) it would be more useful to
get two CQEs:
1. The first signals userspace that it can
    issue the next send-like operation (SEND,SENDZC,SENDMSG,SPLICE)
    on the stream without the risk of byte ordering problem within the stream
    and avoid too high latency (which would happen, if we wait for a send to
    leave the hardware nic, before sending the next PDU).
2. The 2nd signals userspace that the buffer can be reused or released.

In that case it would be useful to also provide a separate 'user_data' element
for the 2nd CQE.

>> I'm also wondering what will happen if a notif will be referenced by the net layer
>> but the io_uring instance is already closed, wouldn't
>> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
>> because notif->ctx is a stale pointer, of notif itself is already gone...
> 
> io_uring will flush all slots and wait for all notifications
> to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
> problem.

I can't follow :-(

What I see is that io_notif_unregister():

                 nd = io_notif_to_data(notif);
                 slot->notif = NULL;
                 if (!refcount_dec_and_test(&nd->uarg.refcnt))
                         continue;

So if the net layer still has a reference we just go on.

Only a wild guess, is it something of:

io_alloc_notif():
         ...
         notif->task = current;
         io_get_task_refs(1);
         notif->rsrc_node = NULL;
         io_req_set_rsrc_node(notif, ctx, 0);
         ...

and

__io_req_complete_put():
                 ...
                 io_req_put_rsrc(req);
                 /*
                  * Selected buffer deallocation in io_clean_op() assumes that
                  * we don't hold ->completion_lock. Clean them here to avoid
                  * deadlocks.
                  */
                 io_put_kbuf_comp(req);
                 io_dismantle_req(req);
                 io_put_task(req->task, 1);
                 ...

that causes io_ring_exit_work() to wait for it.
It would be great if you or someone else could explain this in detail
and maybe adding some comments into the code.

metze
Pavel Begunkov Aug. 15, 2022, 12:19 p.m. UTC | #6
On 8/15/22 12:40, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>> Thanks for giving a thought about the API, are you trying
>> to use it in samba?
> 
> Yes, but I'd need SENDMSGZC and then I'd like to test,
> which variant gives the best performance. It also depends
> on the configured samba vfs module stack.

I can send you a branch this week if you would be
willing to try it out as I'll be sending the "msg" variant
only for 5.21

> My current prototype uses IO_SENDMSG for the header < 250 bytes
> followed by up to 8MBytes via IO_SPLICE if the storage backend also
> supports splice, otherwise I'd try to use IO_SENDMSGZC for header + 8 MBytes payload
> together. If there's encryption turned actice on the connection we would
> most likely always use a bounce buffer and hit the IO_SENDMSGZC case.
> So all in all I'd say we'll use it.

Perfect

> I guess it would be useful for userspace to notice if zero was possible or not.
> 
> __msg_zerocopy_callback() sets SO_EE_CODE_ZEROCOPY_COPIED, maybe
> io_uring_tx_zerocopy_callback() should have something like:
> 
> if (!success)
>      notif->cqe.res = SO_EE_CODE_ZEROCOPY_COPIED;
> 
> This would make it a bit easier to judge if SENDZC is useful for the
> application or not. Or at least have debug message, which would explain
> be able to explain degraded performance to the admin/developer.

Ok, let me think about it


>>>>> Given that this fills in msg almost completely can we also have
>>>>> a version of SENDMSGZC, it would be very useful to also allow
>>>>> msg_control to be passed and as well as an iovec.
>>>>>
>>>>> Would that be possible?
>>>>
>>>> Right, I left it to follow ups as the series is already too long.
>>>>
>>>> fwiw, I'm going to also add addr to IORING_OP_SEND.
>>>
>>>
>>> Given the minimal differences, which were left between
>>> IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
>>> to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
>>> as indication to use the notif slot.
>>
>> And will be even more similar in for-next, but with notifications
>> I'd still prefer different opcodes to get a little bit more
>> flexibility and not making the normal io_uring send path messier.
> 
> Ok, we should just remember the opcode is only u8
> and we already have ~ 50 out of ~250 allocated in ~3 years
> time.
> 
>>> It would means we don't need to waste two opcodes for
>>> IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)
>>>
>>>
>>> I also noticed a problem in io_notif_update()
>>>
>>>          for (; idx < idx_end; idx++) {
>>>                  struct io_notif_slot *slot = &ctx->notif_slots[idx];
>>>
>>>                  if (!slot->notif)
>>>                          continue;
>>>                  if (up->arg)
>>>                          slot->tag = up->arg;
>>>                  io_notif_slot_flush_submit(slot, issue_flags);
>>>          }
>>>
>>>   slot->tag = up->arg is skipped if there is no notif already.
>>>
>>> So you can't just use a 2 linked sqe's with
>>>
>>> IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)
>>
>> slot->notif is lazily initialised with the first send attached to it,
>> so in your example IORING_OP_SENDZC will first create a notification
>> to execute the send and then will flush it.
>>
>> This "if" is there is only to have a more reliable API. We can
>> go over the range and allocate all empty slots and then flush
>> all of them, but allocation failures should be propagated to the
>> userspace when currently the function it can't fail.
>>
>>> I think the if (!slot->notif) should be moved down a bit.
>>
>> Not sure what you mean
> 
> I think it should be:
> 
>                    if (up->arg)
>                            slot->tag = up->arg;
>                    if (!slot->notif)
>                            continue;
>                    io_notif_slot_flush_submit(slot, issue_flags);
> 
> or even:
> 
>                    slot->tag = up->arg;
>                    if (!slot->notif)
>                            continue;
>                    io_notif_slot_flush_submit(slot, issue_flags);
> 
> otherwise IORING_RSRC_UPDATE_NOTIF would not be able to reset the tag,
> if notif was never created or already be flushed.

Ah, you want to update it for later. The idea was to affect only
those notifiers that are flushed by this update.
...

>>> It would somehow be nice to avoid the notif slots at all and somehow
>>> use some kind of multishot request in order to generate two qces.
>>
>> It is there first to ammortise overhead of zerocopy infra and bits
>> for second CQE posting. But more importantly, without it for TCP
>> the send payload size would need to be large enough or performance
>> would suffer, but all depends on the use case. TL;DR; it would be
>> forced to create a new SKB for each new send.
>>
>> For something simpler, I'll push another zc variant that doesn't
>> have notifiers and posts only one CQE and only after the buffers
>> are no more in use by the kernel. This works well for UDP and for
>> some TCP scenarios, but doesn't cover all cases.
> 
> I think (at least for stream sockets) it would be more useful to
> get two CQEs:
> 1. The first signals userspace that it can
>     issue the next send-like operation (SEND,SENDZC,SENDMSG,SPLICE)
>     on the stream without the risk of byte ordering problem within the stream
>     and avoid too high latency (which would happen, if we wait for a send to
>     leave the hardware nic, before sending the next PDU).
> 2. The 2nd signals userspace that the buffer can be reused or released.
> 
> In that case it would be useful to also provide a separate 'user_data' element
> for the 2nd CQE.

...

I had a similar chat with Dylan last week. I'd rather not rob SQE of
additional u64 as there is only addr3 left and then we're fully packed,
but there is another option we were thinking about based on OVERRIDE_TAG
feature I scrapped from the final version of zerocopy patches.

Long story short, the idea is to copy req->cqe.user_data of a
send(+flush) request into the notification CQE, so you'll get 2 CQEs
with identical user_data but they can be distinguished by looking at
cqe->flags.

What do you think? Would it work for you?


>>> I'm also wondering what will happen if a notif will be referenced by the net layer
>>> but the io_uring instance is already closed, wouldn't
>>> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
>>> because notif->ctx is a stale pointer, of notif itself is already gone...
>>
>> io_uring will flush all slots and wait for all notifications
>> to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
>> problem.
> 
> I can't follow :-(
> 
> What I see is that io_notif_unregister():
> 
>                  nd = io_notif_to_data(notif);
>                  slot->notif = NULL;
>                  if (!refcount_dec_and_test(&nd->uarg.refcnt))
>                          continue;
> 
> So if the net layer still has a reference we just go on.
> 
> Only a wild guess, is it something of:
> 
> io_alloc_notif():
>          ...
>          notif->task = current;
>          io_get_task_refs(1);
>          notif->rsrc_node = NULL;
>          io_req_set_rsrc_node(notif, ctx, 0);
>          ...
> 
> and
> 
> __io_req_complete_put():
>                  ...
>                  io_req_put_rsrc(req);
>                  /*
>                   * Selected buffer deallocation in io_clean_op() assumes that
>                   * we don't hold ->completion_lock. Clean them here to avoid
>                   * deadlocks.
>                   */
>                  io_put_kbuf_comp(req);
>                  io_dismantle_req(req);
>                  io_put_task(req->task, 1);
>                  ...
> 
> that causes io_ring_exit_work() to wait for it.> It would be great if you or someone else could explain this in detail
> and maybe adding some comments into the code.

Almost, the mechanism is absolutely the same as with requests,
and notifiers are actually requests for internal purposes.

In __io_alloc_req_refill() we grab ctx->refs, which are waited
for in io_ring_exit_work(). We usually put requests into a cache,
so when a request is complete we don't put the ref and therefore
in io_ring_exit_work() we also have a call to io_req_caches_free(),
which puts ctx->refs.
Stefan Metzmacher Aug. 15, 2022, 1:30 p.m. UTC | #7
Hi Pavel,

>>> Thanks for giving a thought about the API, are you trying
>>> to use it in samba?
>>
>> Yes, but I'd need SENDMSGZC and then I'd like to test,
>> which variant gives the best performance. It also depends
>> on the configured samba vfs module stack.
> 
> I can send you a branch this week if you would be
> willing to try it out as I'll be sending the "msg" variant
> only for 5.21

I'm not sure I'll have time to do runtime testing,
but it would be great to have a look at the code and give some comments
based on that.

>> I think it should be:
>>
>>                    if (up->arg)
>>                            slot->tag = up->arg;
>>                    if (!slot->notif)
>>                            continue;
>>                    io_notif_slot_flush_submit(slot, issue_flags);
>>
>> or even:
>>
>>                    slot->tag = up->arg;
>>                    if (!slot->notif)
>>                            continue;
>>                    io_notif_slot_flush_submit(slot, issue_flags);
>>
>> otherwise IORING_RSRC_UPDATE_NOTIF would not be able to reset the tag,
>> if notif was never created or already be flushed.
> 
> Ah, you want to update it for later. The idea was to affect only
> those notifiers that are flushed by this update.
> ...

notif->cqe.user_data = slot->tag; happens in io_alloc_notif(),
so the slot->tag = up->arg; here is always for the next IO_SENDZC.

With IORING_RSRC_UPDATE_NOTIF linked to a IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)
I basically try to reset slot->tag to the same (or related) user_data as the SENDZC itself.

So that each SENDZC generates two CQEs with the same user_data belonging
to the same userspace buffer.


> I had a similar chat with Dylan last week. I'd rather not rob SQE of
> additional u64 as there is only addr3 left and then we're fully packed,
> but there is another option we were thinking about based on OVERRIDE_TAG
> feature I scrapped from the final version of zerocopy patches.
> 
> Long story short, the idea is to copy req->cqe.user_data of a
> send(+flush) request into the notification CQE, so you'll get 2 CQEs
> with identical user_data but they can be distinguished by looking at
> cqe->flags.
> 
> What do you think? Would it work for you?

I guess that would work.

>>>> I'm also wondering what will happen if a notif will be referenced by the net layer
>>>> but the io_uring instance is already closed, wouldn't
>>>> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
>>>> because notif->ctx is a stale pointer, of notif itself is already gone...
>>>
>>> io_uring will flush all slots and wait for all notifications
>>> to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
>>> problem.
>>
>> I can't follow :-(
>>
>> What I see is that io_notif_unregister():
>>
>>                  nd = io_notif_to_data(notif);
>>                  slot->notif = NULL;
>>                  if (!refcount_dec_and_test(&nd->uarg.refcnt))
>>                          continue;
>>
>> So if the net layer still has a reference we just go on.
>>
>> Only a wild guess, is it something of:
>>
>> io_alloc_notif():
>>          ...
>>          notif->task = current;
>>          io_get_task_refs(1);
>>          notif->rsrc_node = NULL;
>>          io_req_set_rsrc_node(notif, ctx, 0);
>>          ...
>>
>> and
>>
>> __io_req_complete_put():
>>                  ...
>>                  io_req_put_rsrc(req);
>>                  /*
>>                   * Selected buffer deallocation in io_clean_op() assumes that
>>                   * we don't hold ->completion_lock. Clean them here to avoid
>>                   * deadlocks.
>>                   */
>>                  io_put_kbuf_comp(req);
>>                  io_dismantle_req(req);
>>                  io_put_task(req->task, 1);
>>                  ...
>>
>> that causes io_ring_exit_work() to wait for it.> It would be great if you or someone else could explain this in detail
>> and maybe adding some comments into the code.
> 
> Almost, the mechanism is absolutely the same as with requests,
> and notifiers are actually requests for internal purposes.
> 
> In __io_alloc_req_refill() we grab ctx->refs, which are waited
> for in io_ring_exit_work(). We usually put requests into a cache,
> so when a request is complete we don't put the ref and therefore
> in io_ring_exit_work() we also have a call to io_req_caches_free(),
> which puts ctx->refs.

Ok, thanks.

Would a close() on the ring fd block? I guess not, but the exit_work may block, correct?
So a process would be a zombie until net released all references?

metze
Pavel Begunkov Aug. 15, 2022, 2:09 p.m. UTC | #8
On 8/15/22 14:30, Stefan Metzmacher wrote:
[...]
>>> that causes io_ring_exit_work() to wait for it.> It would be great if you or someone else could explain this in detail
>>> and maybe adding some comments into the code.
>>
>> Almost, the mechanism is absolutely the same as with requests,
>> and notifiers are actually requests for internal purposes.
>>
>> In __io_alloc_req_refill() we grab ctx->refs, which are waited
>> for in io_ring_exit_work(). We usually put requests into a cache,
>> so when a request is complete we don't put the ref and therefore
>> in io_ring_exit_work() we also have a call to io_req_caches_free(),
>> which puts ctx->refs.
> 
> Ok, thanks.
> 
> Would a close() on the ring fd block? I guess not, but the exit_work may block, correct?

Right, close doesn't block, it queues exit_work to get executed async.
exit_work() will wait for ctx->refs and then will free most of
io_uring resources.

> So a process would be a zombie until net released all references?

The userspace process will exit normally, but you can find a kernel
thread (kworker) doing the job.
diff mbox series

Patch

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 838030477456..a1e9405a3f1b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -722,6 +722,8 @@  struct io_sendzc {
 	size_t				len;
 	u16				slot_idx;
 	int				msg_flags;
+	int				addr_len;
+	void __user			*addr;
 };
 
 struct io_open {
@@ -6572,7 +6574,7 @@  static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_sendzc *zc = &req->msgzc;
 
-	if (READ_ONCE(sqe->ioprio) || READ_ONCE(sqe->addr2) || READ_ONCE(sqe->__pad2[0]))
+	if (READ_ONCE(sqe->ioprio) || READ_ONCE(sqe->__pad2[0]))
 		return -EINVAL;
 
 	zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -6581,6 +6583,9 @@  static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	zc->slot_idx = READ_ONCE(sqe->notification_idx);
 	if (zc->msg_flags & MSG_DONTWAIT)
 		req->flags |= REQ_F_NOWAIT;
+	zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+	zc->addr_len = READ_ONCE(sqe->addr_len);
+
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
 		zc->msg_flags |= MSG_CMSG_COMPAT;
@@ -6590,6 +6595,7 @@  static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 {
+	struct sockaddr_storage address;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_sendzc *zc = &req->msgzc;
 	struct io_notif_slot *notif_slot;
@@ -6624,6 +6630,14 @@  static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 		return ret;
 	mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
 
+	if (zc->addr) {
+		ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
+		if (unlikely(ret < 0))
+			return ret;
+		msg.msg_name = (struct sockaddr *)&address;
+		msg.msg_namelen = zc->addr_len;
+	}
+
 	msg_flags = zc->msg_flags | MSG_ZEROCOPY;
 	if (issue_flags & IO_URING_F_NONBLOCK)
 		msg_flags |= MSG_DONTWAIT;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6c6f20ae5a95..689aa1444cd4 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -63,7 +63,7 @@  struct io_uring_sqe {
 		__u32	file_index;
 		struct {
 			__u16	notification_idx;
-			__u16	__pad;
+			__u16	addr_len;
 		} __attribute__((packed));
 	};
 	union {