diff mbox series

[6/8] io_uring/net: support multishot for send

Message ID 20240225003941.129030-7-axboe@kernel.dk (mailing list archive)
State New
Headers show
Series Support for provided buffers for send | expand

Commit Message

Jens Axboe Feb. 25, 2024, 12:35 a.m. UTC
This works very much like the receive side, except for sends. The idea
is that an application can fill outgoing buffers in a provided buffer
group, and then arm a single send that will service them all. For now
this variant just terminates when we are out of buffers to send, and
hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
set, as per usual for multishot requests.

This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming
in a separate patch. However, this patch does do a lot of the prep
work that makes wiring up the sendmsg variant pretty trivial. They
share the prep side.

Enabling multishot for sends is, again, identical to the receive side.
The app sets IORING_SEND_MULTISHOT in sqe->ioprio. This flag is also
the same as IORING_RECV_MULTISHOT.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h |  8 +++
 io_uring/net.c                | 98 +++++++++++++++++++++++++++++------
 2 files changed, 89 insertions(+), 17 deletions(-)

Comments

Dylan Yudaken Feb. 26, 2024, 10:47 a.m. UTC | #1
On Sun, Feb 25, 2024 at 12:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> This works very much like the receive side, except for sends. The idea
> is that an application can fill outgoing buffers in a provided buffer
> group, and then arm a single send that will service them all. For now
> this variant just terminates when we are out of buffers to send, and
> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
> set, as per usual for multishot requests.
>

This feels to me a lot like just using OP_SEND with MSG_WAITALL as
described, unless I'm missing something?

I actually could imagine it being useful for the previous patches' use
case of queuing up sends and keeping ordering,
and I think the API is more obvious (rather than the second CQE
sending the first CQE's data). So maybe it's worth only
keeping one approach?
Jens Axboe Feb. 26, 2024, 1:38 p.m. UTC | #2
On 2/26/24 3:47 AM, Dylan Yudaken wrote:
> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> This works very much like the receive side, except for sends. The idea
>> is that an application can fill outgoing buffers in a provided buffer
>> group, and then arm a single send that will service them all. For now
>> this variant just terminates when we are out of buffers to send, and
>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>> set, as per usual for multishot requests.
>>
> 
> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
> described, unless I'm missing something?

How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
try again" where multishot is "send data from this buffer group, and
keep sending data until it's empty". Hence it's the mirror of multishot
on the receive side. Unless I'm misunderstanding you somehow, not sure
it'd be smart to add special meaning to MSG_WAITALL with provided
buffers.

> I actually could imagine it being useful for the previous patches' use
> case of queuing up sends and keeping ordering,
> and I think the API is more obvious (rather than the second CQE
> sending the first CQE's data). So maybe it's worth only
> keeping one approach?

And here you totally lost me :-)
Dylan Yudaken Feb. 26, 2024, 2:02 p.m. UTC | #3
On Mon, Feb 26, 2024 at 1:38 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
> > On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> This works very much like the receive side, except for sends. The idea
> >> is that an application can fill outgoing buffers in a provided buffer
> >> group, and then arm a single send that will service them all. For now
> >> this variant just terminates when we are out of buffers to send, and
> >> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
> >> set, as per usual for multishot requests.
> >>
> >
> > This feels to me a lot like just using OP_SEND with MSG_WAITALL as
> > described, unless I'm missing something?
>
> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
> try again" where multishot is "send data from this buffer group, and
> keep sending data until it's empty". Hence it's the mirror of multishot
> on the receive side. Unless I'm misunderstanding you somehow, not sure
> it'd be smart to add special meaning to MSG_WAITALL with provided
> buffers.
>

_If_ you have the data upfront these are very similar, and only differ in that
the multishot approach will give you more granular progress updates.
My point was that this might not be a valuable API to people for only this
use case.

You do make a good point about MSG_WAITALL though - multishot send
doesn't really make sense to me without MSG_WAITALL semantics.
I cannot imagine a useful use case where the first buffer being partially sent
will still want the second buffer sent.

> > I actually could imagine it being useful for the previous patches' use
> > case of queuing up sends and keeping ordering,
> > and I think the API is more obvious (rather than the second CQE
> > sending the first CQE's data). So maybe it's worth only
> > keeping one approach?
>
> And here you totally lost me :-)

I am suggesting here that you don't really need to support buffer
lists on send without multishot.

It's a slightly confusing API (to me) that you queue PushBuffer(A),
Send(A), PushBuffer(B), Send(B)
and get back Res(B), Res(A) which are in fact in order A->B.

Instead you could queue up PushBuffer(A), Send(Multishot),
PushBuffer(B), and get back Res(Multishot), Res(Multishot)
which are in order A -> B.

The downside here is that userspace has to handle requeueing the SQE
if A completes before B is pushed. I leave it to you
if that is not desirable. I can see arguments for both sides.
Jens Axboe Feb. 26, 2024, 2:27 p.m. UTC | #4
On 2/26/24 7:02 AM, Dylan Yudaken wrote:
> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> This works very much like the receive side, except for sends. The idea
>>>> is that an application can fill outgoing buffers in a provided buffer
>>>> group, and then arm a single send that will service them all. For now
>>>> this variant just terminates when we are out of buffers to send, and
>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>>>> set, as per usual for multishot requests.
>>>>
>>>
>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
>>> described, unless I'm missing something?
>>
>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
>> try again" where multishot is "send data from this buffer group, and
>> keep sending data until it's empty". Hence it's the mirror of multishot
>> on the receive side. Unless I'm misunderstanding you somehow, not sure
>> it'd be smart to add special meaning to MSG_WAITALL with provided
>> buffers.
>>
> 
> _If_ you have the data upfront these are very similar, and only differ
> in that the multishot approach will give you more granular progress
> updates. My point was that this might not be a valuable API to people
> for only this use case.

Not sure I agree, it feels like attributing a different meaning to
MSG_WAITALL if you use a provided buffer vs if you don't. And that to me
would seem to be confusing. Particularly when we have multishot on the
receive side, and this is identical, just for sends. Receives will keep
receiving as long as there are buffers in the provided group to receive
into, and sends will keep sending for the same condition. Either one
will terminate if we run out of buffers.

If you make MSG_WAITALL be that for provided buffers + send, then that
behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with
send _without_ provided buffers. I don't think overloading an existing
flag for this purposes is a good idea, particularly when we already have
the existing semantics for multishot on the receive side.

> You do make a good point about MSG_WAITALL though - multishot send
> doesn't really make sense to me without MSG_WAITALL semantics. I
> cannot imagine a useful use case where the first buffer being
> partially sent will still want the second buffer sent.

Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
make it implied for multishot send. Currently the code doesn't deal with
that.

Maybe if MSG_WAITALL isn't set and we get a short send we don't set
CQE_F_MORE and we just stop. If it is set, then we go through the usual
retry logic. That would make it identical to MSG_WAITALL send without
multishot, which again is something I like in that we don't have
different behaviors depending on which mode we are using.

>>> I actually could imagine it being useful for the previous patches' use
>>> case of queuing up sends and keeping ordering,
>>> and I think the API is more obvious (rather than the second CQE
>>> sending the first CQE's data). So maybe it's worth only
>>> keeping one approach?
>>
>> And here you totally lost me :-)
> 
> I am suggesting here that you don't really need to support buffer
> lists on send without multishot.

That is certainly true, but I also don't see a reason _not_ to support
it. Again mostly because this is how receive and everything else works.
The app is free to issue a single SQE for send without multishot, and
pick the first buffer and send it.

> It's a slightly confusing API (to me) that you queue PushBuffer(A),
> Send(A), PushBuffer(B), Send(B)
> and get back Res(B), Res(A) which are in fact in order A->B.

Now I'm confused again. If you do do the above sequence, the first CQE
posted would be Res(A), and then Res(B)?

> Instead you could queue up PushBuffer(A), Send(Multishot),
> PushBuffer(B), and get back Res(Multishot), Res(Multishot)
> which are in order A -> B.

There should be no difference in ordering of the posted completion
between the two.
Pavel Begunkov Feb. 26, 2024, 2:36 p.m. UTC | #5
On 2/26/24 14:27, Jens Axboe wrote:
> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> This works very much like the receive side, except for sends. The idea
>>>>> is that an application can fill outgoing buffers in a provided buffer
>>>>> group, and then arm a single send that will service them all. For now
>>>>> this variant just terminates when we are out of buffers to send, and
>>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>>>>> set, as per usual for multishot requests.
>>>>>
>>>>
>>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
>>>> described, unless I'm missing something?
>>>
>>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
>>> try again" where multishot is "send data from this buffer group, and
>>> keep sending data until it's empty". Hence it's the mirror of multishot
>>> on the receive side. Unless I'm misunderstanding you somehow, not sure
>>> it'd be smart to add special meaning to MSG_WAITALL with provided
>>> buffers.
>>>
>>
>> _If_ you have the data upfront these are very similar, and only differ
>> in that the multishot approach will give you more granular progress
>> updates. My point was that this might not be a valuable API to people
>> for only this use case.
> 
> Not sure I agree, it feels like attributing a different meaning to
> MSG_WAITALL if you use a provided buffer vs if you don't. And that to me
> would seem to be confusing. Particularly when we have multishot on the
> receive side, and this is identical, just for sends. Receives will keep
> receiving as long as there are buffers in the provided group to receive
> into, and sends will keep sending for the same condition. Either one
> will terminate if we run out of buffers.
> 
> If you make MSG_WAITALL be that for provided buffers + send, then that
> behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with
> send _without_ provided buffers. I don't think overloading an existing
> flag for this purposes is a good idea, particularly when we already have
> the existing semantics for multishot on the receive side.

I'm actually with Dylan on that and wonder where the perf win
could come from. Let's assume TCP, sends are usually completed
in the same syscall, otherwise your pacing is just bad. Thrift,
for example, collects sends and packs into one multi iov request
during a loop iteration. If the req completes immediately then
the userspace just wouldn't have time to push more buffers by
definition (assuming single threading).

If you actually need to poll tx, you send a request and collect
data into iov in userspace in background. When the request
completes you send all that in batch. You can probably find
a niche example when batch=1 in this case, but I don't think
anyone would care.

The example doesn't use multi-iov, and also still has to
serialise requests, which naturally serialises buffer consumption
w/o provided bufs.

>> You do make a good point about MSG_WAITALL though - multishot send
>> doesn't really make sense to me without MSG_WAITALL semantics. I
>> cannot imagine a useful use case where the first buffer being
>> partially sent will still want the second buffer sent.
> 
> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
> make it implied for multishot send. Currently the code doesn't deal with
> that.
> 
> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
> CQE_F_MORE and we just stop. If it is set, then we go through the usual
> retry logic. That would make it identical to MSG_WAITALL send without
> multishot, which again is something I like in that we don't have
> different behaviors depending on which mode we are using.
> 
>>>> I actually could imagine it being useful for the previous patches' use
>>>> case of queuing up sends and keeping ordering,
>>>> and I think the API is more obvious (rather than the second CQE
>>>> sending the first CQE's data). So maybe it's worth only
>>>> keeping one approach?
>>>
>>> And here you totally lost me :-)
>>
>> I am suggesting here that you don't really need to support buffer
>> lists on send without multishot.
> 
> That is certainly true, but I also don't see a reason _not_ to support
> it. Again mostly because this is how receive and everything else works.
> The app is free to issue a single SQE for send without multishot, and
> pick the first buffer and send it.

Multishot sound interesting, but I don't see it much useful if
you terminate when there are no buffers. Otherwise, if it continues
to sit in, someone would have to wake it up

>> It's a slightly confusing API (to me) that you queue PushBuffer(A),
>> Send(A), PushBuffer(B), Send(B)
>> and get back Res(B), Res(A) which are in fact in order A->B.
> 
> Now I'm confused again. If you do do the above sequence, the first CQE
> posted would be Res(A), and then Res(B)?
> 
>> Instead you could queue up PushBuffer(A), Send(Multishot),
>> PushBuffer(B), and get back Res(Multishot), Res(Multishot)
>> which are in order A -> B.
> 
> There should be no difference in ordering of the posted completion
> between the two.
>
Jens Axboe Feb. 26, 2024, 3:16 p.m. UTC | #6
On 2/26/24 7:36 AM, Pavel Begunkov wrote:
> On 2/26/24 14:27, Jens Axboe wrote:
>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> This works very much like the receive side, except for sends. The idea
>>>>>> is that an application can fill outgoing buffers in a provided buffer
>>>>>> group, and then arm a single send that will service them all. For now
>>>>>> this variant just terminates when we are out of buffers to send, and
>>>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>>>>>> set, as per usual for multishot requests.
>>>>>>
>>>>>
>>>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
>>>>> described, unless I'm missing something?
>>>>
>>>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
>>>> try again" where multishot is "send data from this buffer group, and
>>>> keep sending data until it's empty". Hence it's the mirror of multishot
>>>> on the receive side. Unless I'm misunderstanding you somehow, not sure
>>>> it'd be smart to add special meaning to MSG_WAITALL with provided
>>>> buffers.
>>>>
>>>
>>> _If_ you have the data upfront these are very similar, and only differ
>>> in that the multishot approach will give you more granular progress
>>> updates. My point was that this might not be a valuable API to people
>>> for only this use case.
>>
>> Not sure I agree, it feels like attributing a different meaning to
>> MSG_WAITALL if you use a provided buffer vs if you don't. And that to me
>> would seem to be confusing. Particularly when we have multishot on the
>> receive side, and this is identical, just for sends. Receives will keep
>> receiving as long as there are buffers in the provided group to receive
>> into, and sends will keep sending for the same condition. Either one
>> will terminate if we run out of buffers.
>>
>> If you make MSG_WAITALL be that for provided buffers + send, then that
>> behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with
>> send _without_ provided buffers. I don't think overloading an existing
>> flag for this purposes is a good idea, particularly when we already have
>> the existing semantics for multishot on the receive side.
> 
> I'm actually with Dylan on that and wonder where the perf win
> could come from. Let's assume TCP, sends are usually completed
> in the same syscall, otherwise your pacing is just bad. Thrift,
> for example, collects sends and packs into one multi iov request
> during a loop iteration. If the req completes immediately then
> the userspace just wouldn't have time to push more buffers by
> definition (assuming single threading).

The problem only occurs when they don't complete inline, and now you get
reordering. The application could of course attempt to do proper pacing
and see if it can avoid that condition. If not, it now needs to
serialize sends. Using provided buffers makes this very easy, as you
don't need to care about it at all, and it eliminates complexity in the
application dealing with this.

> If you actually need to poll tx, you send a request and collect
> data into iov in userspace in background. When the request
> completes you send all that in batch. You can probably find
> a niche example when batch=1 in this case, but I don't think
> anyone would care.
> 
> The example doesn't use multi-iov, and also still has to
> serialise requests, which naturally serialises buffer consumption
> w/o provided bufs.

IMHO there's no reason NOT to have both a send with provided buffers and
a multishot send. The alternative would be to have send-N, where you
pass in N. But I don't see much point to that over "just drain the whole
pending list". The obvious use case is definitely send multishot, but
what would the reasoning be to prohibit pacing by explicitly disallowing
only doing a single buffer (or a partial queue)? As mentioned earlier, I
like keeping the symmetry with the receive side for multishot, and not
make it any different unless there's a reason to.

>>> You do make a good point about MSG_WAITALL though - multishot send
>>> doesn't really make sense to me without MSG_WAITALL semantics. I
>>> cannot imagine a useful use case where the first buffer being
>>> partially sent will still want the second buffer sent.
>>
>> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
>> make it implied for multishot send. Currently the code doesn't deal with
>> that.
>>
>> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
>> CQE_F_MORE and we just stop. If it is set, then we go through the usual
>> retry logic. That would make it identical to MSG_WAITALL send without
>> multishot, which again is something I like in that we don't have
>> different behaviors depending on which mode we are using.
>>
>>>>> I actually could imagine it being useful for the previous patches' use
>>>>> case of queuing up sends and keeping ordering,
>>>>> and I think the API is more obvious (rather than the second CQE
>>>>> sending the first CQE's data). So maybe it's worth only
>>>>> keeping one approach?
>>>>
>>>> And here you totally lost me :-)
>>>
>>> I am suggesting here that you don't really need to support buffer
>>> lists on send without multishot.
>>
>> That is certainly true, but I also don't see a reason _not_ to support
>> it. Again mostly because this is how receive and everything else works.
>> The app is free to issue a single SQE for send without multishot, and
>> pick the first buffer and send it.
> 
> Multishot sound interesting, but I don't see it much useful if
> you terminate when there are no buffers. Otherwise, if it continues
> to sit in, someone would have to wake it up

I did think about the termination case, and the problem is that if there
are no buffers, you need it to wake when there are buffers. And at that
point you may as well just do another send, as you need the application
to trigger it. The alternative would be to invent a way to trigger that
wakeup, which would be send only and weird just because of that.

If you end up poll armed because the socket buffer is full, then it
certainly makes sense to stay armed as there's a natural trigger there
when that condition resolves.
Pavel Begunkov Feb. 26, 2024, 3:41 p.m. UTC | #7
On 2/26/24 15:16, Jens Axboe wrote:
> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>> On 2/26/24 14:27, Jens Axboe wrote:
>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> This works very much like the receive side, except for sends. The idea
>>>>>>> is that an application can fill outgoing buffers in a provided buffer
>>>>>>> group, and then arm a single send that will service them all. For now
>>>>>>> this variant just terminates when we are out of buffers to send, and
>>>>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>>>>>>> set, as per usual for multishot requests.
>>>>>>>
>>>>>>
>>>>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
>>>>>> described, unless I'm missing something?
>>>>>
>>>>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
>>>>> try again" where multishot is "send data from this buffer group, and
>>>>> keep sending data until it's empty". Hence it's the mirror of multishot
>>>>> on the receive side. Unless I'm misunderstanding you somehow, not sure
>>>>> it'd be smart to add special meaning to MSG_WAITALL with provided
>>>>> buffers.
>>>>>
>>>>
>>>> _If_ you have the data upfront these are very similar, and only differ
>>>> in that the multishot approach will give you more granular progress
>>>> updates. My point was that this might not be a valuable API to people
>>>> for only this use case.
>>>
>>> Not sure I agree, it feels like attributing a different meaning to
>>> MSG_WAITALL if you use a provided buffer vs if you don't. And that to me
>>> would seem to be confusing. Particularly when we have multishot on the
>>> receive side, and this is identical, just for sends. Receives will keep
>>> receiving as long as there are buffers in the provided group to receive
>>> into, and sends will keep sending for the same condition. Either one
>>> will terminate if we run out of buffers.
>>>
>>> If you make MSG_WAITALL be that for provided buffers + send, then that
>>> behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with
>>> send _without_ provided buffers. I don't think overloading an existing
>>> flag for this purposes is a good idea, particularly when we already have
>>> the existing semantics for multishot on the receive side.
>>
>> I'm actually with Dylan on that and wonder where the perf win
>> could come from. Let's assume TCP, sends are usually completed
>> in the same syscall, otherwise your pacing is just bad. Thrift,
>> for example, collects sends and packs into one multi iov request
>> during a loop iteration. If the req completes immediately then
>> the userspace just wouldn't have time to push more buffers by
>> definition (assuming single threading).
> 
> The problem only occurs when they don't complete inline, and now you get
> reordering. The application could of course attempt to do proper pacing
> and see if it can avoid that condition. If not, it now needs to

Ok, I admit that there are more than valid cases when artificial pacing
is not an option, which is why I also laid out the polling case.
Let's also say that limits potential perf wins to streaming and very
large transfers (like files), not "lots of relatively small
request-response" kinds of apps.

> serialize sends. Using provided buffers makes this very easy, as you
> don't need to care about it at all, and it eliminates complexity in the
> application dealing with this.

If I'm correct the example also serialises sends(?). I don't
think it's that simpler. You batch, you send. Same with this,
but batch into a provided buffer and the send is conditional.

Another downside is that you need a provided queue per socket,
which sounds pretty expensive for 100s if not 1000s socket
apps.

>> If you actually need to poll tx, you send a request and collect
>> data into iov in userspace in background. When the request
>> completes you send all that in batch. You can probably find
>> a niche example when batch=1 in this case, but I don't think
>> anyone would care.
>>
>> The example doesn't use multi-iov, and also still has to
>> serialise requests, which naturally serialises buffer consumption
>> w/o provided bufs.
> 
> IMHO there's no reason NOT to have both a send with provided buffers and
> a multishot send. The alternative would be to have send-N, where you
> pass in N. But I don't see much point to that over "just drain the whole
> pending list". The obvious use case is definitely send multishot, but

Not sure I follow, but in all cases I was contemplating about
you sends everything you have at the moment.

> what would the reasoning be to prohibit pacing by explicitly disallowing
> only doing a single buffer (or a partial queue)? As mentioned earlier, I
> like keeping the symmetry with the receive side for multishot, and not
> make it any different unless there's a reason to.

There are different, buffer content kernel (rx) vs userspace (tx)
provided, provided queue / group per socket vs shared. Wake ups
for multishots as per below. It's not like it's a one line change,
so IMHO requires to be giving some benefits.

>>>> You do make a good point about MSG_WAITALL though - multishot send
>>>> doesn't really make sense to me without MSG_WAITALL semantics. I
>>>> cannot imagine a useful use case where the first buffer being
>>>> partially sent will still want the second buffer sent.
>>>
>>> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
>>> make it implied for multishot send. Currently the code doesn't deal with
>>> that.
>>>
>>> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
>>> CQE_F_MORE and we just stop. If it is set, then we go through the usual
>>> retry logic. That would make it identical to MSG_WAITALL send without
>>> multishot, which again is something I like in that we don't have
>>> different behaviors depending on which mode we are using.
>>>
>>>>>> I actually could imagine it being useful for the previous patches' use
>>>>>> case of queuing up sends and keeping ordering,
>>>>>> and I think the API is more obvious (rather than the second CQE
>>>>>> sending the first CQE's data). So maybe it's worth only
>>>>>> keeping one approach?
>>>>>
>>>>> And here you totally lost me :-)
>>>>
>>>> I am suggesting here that you don't really need to support buffer
>>>> lists on send without multishot.
>>>
>>> That is certainly true, but I also don't see a reason _not_ to support
>>> it. Again mostly because this is how receive and everything else works.
>>> The app is free to issue a single SQE for send without multishot, and
>>> pick the first buffer and send it.
>>
>> Multishot sound interesting, but I don't see it much useful if
>> you terminate when there are no buffers. Otherwise, if it continues
>> to sit in, someone would have to wake it up
> 
> I did think about the termination case, and the problem is that if there
> are no buffers, you need it to wake when there are buffers. And at that
> point you may as well just do another send, as you need the application
> to trigger it. The alternative would be to invent a way to trigger that
> wakeup, which would be send only and weird just because of that.

Yeah, that's the point, wake ups would be userspace driven, and how
to do it without heavy stuff like syscalls is not so clear.

> If you end up poll armed because the socket buffer is full, then it
> certainly makes sense to stay armed as there's a natural trigger there
> when that condition resolves.
Jens Axboe Feb. 26, 2024, 7:11 p.m. UTC | #8
On 2/26/24 8:41 AM, Pavel Begunkov wrote:
> On 2/26/24 15:16, Jens Axboe wrote:
>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>
>>>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>
>>>>>>>> This works very much like the receive side, except for sends. The idea
>>>>>>>> is that an application can fill outgoing buffers in a provided buffer
>>>>>>>> group, and then arm a single send that will service them all. For now
>>>>>>>> this variant just terminates when we are out of buffers to send, and
>>>>>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>>>>>>>> set, as per usual for multishot requests.
>>>>>>>>
>>>>>>>
>>>>>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
>>>>>>> described, unless I'm missing something?
>>>>>>
>>>>>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
>>>>>> try again" where multishot is "send data from this buffer group, and
>>>>>> keep sending data until it's empty". Hence it's the mirror of multishot
>>>>>> on the receive side. Unless I'm misunderstanding you somehow, not sure
>>>>>> it'd be smart to add special meaning to MSG_WAITALL with provided
>>>>>> buffers.
>>>>>>
>>>>>
>>>>> _If_ you have the data upfront these are very similar, and only differ
>>>>> in that the multishot approach will give you more granular progress
>>>>> updates. My point was that this might not be a valuable API to people
>>>>> for only this use case.
>>>>
>>>> Not sure I agree, it feels like attributing a different meaning to
>>>> MSG_WAITALL if you use a provided buffer vs if you don't. And that to me
>>>> would seem to be confusing. Particularly when we have multishot on the
>>>> receive side, and this is identical, just for sends. Receives will keep
>>>> receiving as long as there are buffers in the provided group to receive
>>>> into, and sends will keep sending for the same condition. Either one
>>>> will terminate if we run out of buffers.
>>>>
>>>> If you make MSG_WAITALL be that for provided buffers + send, then that
>>>> behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with
>>>> send _without_ provided buffers. I don't think overloading an existing
>>>> flag for this purposes is a good idea, particularly when we already have
>>>> the existing semantics for multishot on the receive side.
>>>
>>> I'm actually with Dylan on that and wonder where the perf win
>>> could come from. Let's assume TCP, sends are usually completed
>>> in the same syscall, otherwise your pacing is just bad. Thrift,
>>> for example, collects sends and packs into one multi iov request
>>> during a loop iteration. If the req completes immediately then
>>> the userspace just wouldn't have time to push more buffers by
>>> definition (assuming single threading).
>>
>> The problem only occurs when they don't complete inline, and now you get
>> reordering. The application could of course attempt to do proper pacing
>> and see if it can avoid that condition. If not, it now needs to
> 
> Ok, I admit that there are more than valid cases when artificial pacing
> is not an option, which is why I also laid out the polling case.
> Let's also say that limits potential perf wins to streaming and very
> large transfers (like files), not "lots of relatively small
> request-response" kinds of apps.

I don't think that's true - if you're doing large streaming, you're more
likely to keep the socket buffer full, whereas for smallish sends, it's
less likely to be full. Testing with the silly proxy confirms that. And
outside of cases where pacing just isn't feasible, it's extra overhead
for cases where you potentially could or what. To me, the main appeal of
this is the simplicity.

>> serialize sends. Using provided buffers makes this very easy, as you
>> don't need to care about it at all, and it eliminates complexity in the
>> application dealing with this.
> 
> If I'm correct the example also serialises sends(?). I don't
> think it's that simpler. You batch, you send. Same with this,
> but batch into a provided buffer and the send is conditional.

Do you mean the proxy example? Just want to be sure we're talking about
the same thing. Yes it has to serialize sends, because otherwise we can
run into the condition described in the patch that adds provided buffer
support for send. But I did bench multishot separately from there,
here's some of it:

10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte packets.
Send ring and send multishot not used:

Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw
=====================================================
1000   |    No	   |  No   |   437  | 1.22M | 9598M
32     |    No	   |  No   |  5856  | 2.87M |  734M

Same test, now turn on send ring:

Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
===========================================================
1000   |    Yes	   |  No   |   436  | 1.23M | 9620M | + 0.2%
32     |    Yes	   |  No   |  3462  | 4.85M | 1237M | +68.5%

Same test, now turn on send mshot as well:

Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
===========================================================
1000   |    Yes	   |  Yes  |   436  | 1.23M | 9620M | + 0.2%
32     |    Yes	   |  Yes  |  3125  | 5.37M | 1374M | +87.2%

which does show that there's another win on top for just queueing these
sends and doing a single send to handle them, rather than needing to
prepare a send for each buffer. Part of that may be that you simply run
out of SQEs and then have to submit regardless of where you are at.

> Another downside is that you need a provided queue per socket,
> which sounds pretty expensive for 100s if not 1000s socket
> apps.

That's certainly true. But either you need backlog per socket anyway in
the app, or you only send single buffers anyway (in a typical
request/response kind of fashion) between receives and you don't need it
at all.

>>> If you actually need to poll tx, you send a request and collect
>>> data into iov in userspace in background. When the request
>>> completes you send all that in batch. You can probably find
>>> a niche example when batch=1 in this case, but I don't think
>>> anyone would care.
>>>
>>> The example doesn't use multi-iov, and also still has to
>>> serialise requests, which naturally serialises buffer consumption
>>> w/o provided bufs.
>>
>> IMHO there's no reason NOT to have both a send with provided buffers and
>> a multishot send. The alternative would be to have send-N, where you
>> pass in N. But I don't see much point to that over "just drain the whole
>> pending list". The obvious use case is definitely send multishot, but
> 
> Not sure I follow, but in all cases I was contemplating about
> you sends everything you have at the moment.
> 
>> what would the reasoning be to prohibit pacing by explicitly disallowing
>> only doing a single buffer (or a partial queue)? As mentioned earlier, I
>> like keeping the symmetry with the receive side for multishot, and not
>> make it any different unless there's a reason to.
> 
> There are different, buffer content kernel (rx) vs userspace (tx)
> provided, provided queue / group per socket vs shared. Wake ups
> for multishots as per below. It's not like it's a one line change,
> so IMHO requires to be giving some benefits.

Are you talking about provided buffers, or multishot specifically? I
think both are standalone pretty much as simple as they can be. And if
the argument is "just have send with provided buffers be multishot by
default", then that single patch is basically the two patches combined.
There's no simplification there. Outside of a strong argument for why it
would never make sense to do single shot send with provided buffers, I
really don't want to combine them into one single action.

>>>>> You do make a good point about MSG_WAITALL though - multishot send
>>>>> doesn't really make sense to me without MSG_WAITALL semantics. I
>>>>> cannot imagine a useful use case where the first buffer being
>>>>> partially sent will still want the second buffer sent.
>>>>
>>>> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
>>>> make it implied for multishot send. Currently the code doesn't deal with
>>>> that.
>>>>
>>>> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
>>>> CQE_F_MORE and we just stop. If it is set, then we go through the usual
>>>> retry logic. That would make it identical to MSG_WAITALL send without
>>>> multishot, which again is something I like in that we don't have
>>>> different behaviors depending on which mode we are using.
>>>>
>>>>>>> I actually could imagine it being useful for the previous patches' use
>>>>>>> case of queuing up sends and keeping ordering,
>>>>>>> and I think the API is more obvious (rather than the second CQE
>>>>>>> sending the first CQE's data). So maybe it's worth only
>>>>>>> keeping one approach?
>>>>>>
>>>>>> And here you totally lost me :-)
>>>>>
>>>>> I am suggesting here that you don't really need to support buffer
>>>>> lists on send without multishot.
>>>>
>>>> That is certainly true, but I also don't see a reason _not_ to support
>>>> it. Again mostly because this is how receive and everything else works.
>>>> The app is free to issue a single SQE for send without multishot, and
>>>> pick the first buffer and send it.
>>>
>>> Multishot sound interesting, but I don't see it much useful if
>>> you terminate when there are no buffers. Otherwise, if it continues
>>> to sit in, someone would have to wake it up
>>
>> I did think about the termination case, and the problem is that if there
>> are no buffers, you need it to wake when there are buffers. And at that
>> point you may as well just do another send, as you need the application
>> to trigger it. The alternative would be to invent a way to trigger that
>> wakeup, which would be send only and weird just because of that.
> 
> Yeah, that's the point, wake ups would be userspace driven, and how
> to do it without heavy stuff like syscalls is not so clear.

It's just not possible without eg polling, either directly or using some
monitor/mwait arch specific thing which would be awful. Or by doing some
manual wakeup, which would need to lookup and kick the request, which I
bet would be worse than just re-arming the send multishot.

If you could poll trigger it somehow, it also further complicates things
as now it could potentially happen at any time. As it stands, the app
knows when a poll multishot is armed (and submitted, or not), and can
serialize with the outgoing buffer queue trivially.
Pavel Begunkov Feb. 26, 2024, 7:21 p.m. UTC | #9
On 2/26/24 19:11, Jens Axboe wrote:
> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>> On 2/26/24 15:16, Jens Axboe wrote:
>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>
>>>>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>>>>>
>>>>>>>>> This works very much like the receive side, except for sends. The idea
>>>>>>>>> is that an application can fill outgoing buffers in a provided buffer
>>>>>>>>> group, and then arm a single send that will service them all. For now
>>>>>>>>> this variant just terminates when we are out of buffers to send, and
>>>>>>>>> hence the application needs to re-arm it if IORING_CQE_F_MORE isn't
>>>>>>>>> set, as per usual for multishot requests.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This feels to me a lot like just using OP_SEND with MSG_WAITALL as
>>>>>>>> described, unless I'm missing something?
>>>>>>>
>>>>>>> How so? MSG_WAITALL is "send X amount of data, and if it's a short send,
>>>>>>> try again" where multishot is "send data from this buffer group, and
>>>>>>> keep sending data until it's empty". Hence it's the mirror of multishot
>>>>>>> on the receive side. Unless I'm misunderstanding you somehow, not sure
>>>>>>> it'd be smart to add special meaning to MSG_WAITALL with provided
>>>>>>> buffers.
>>>>>>>
>>>>>>
>>>>>> _If_ you have the data upfront these are very similar, and only differ
>>>>>> in that the multishot approach will give you more granular progress
>>>>>> updates. My point was that this might not be a valuable API to people
>>>>>> for only this use case.
>>>>>
>>>>> Not sure I agree, it feels like attributing a different meaning to
>>>>> MSG_WAITALL if you use a provided buffer vs if you don't. And that to me
>>>>> would seem to be confusing. Particularly when we have multishot on the
>>>>> receive side, and this is identical, just for sends. Receives will keep
>>>>> receiving as long as there are buffers in the provided group to receive
>>>>> into, and sends will keep sending for the same condition. Either one
>>>>> will terminate if we run out of buffers.
>>>>>
>>>>> If you make MSG_WAITALL be that for provided buffers + send, then that
>>>>> behaves differently than MSG_WAITALL with receive, and MSG_WAITALL with
>>>>> send _without_ provided buffers. I don't think overloading an existing
>>>>> flag for this purposes is a good idea, particularly when we already have
>>>>> the existing semantics for multishot on the receive side.
>>>>
>>>> I'm actually with Dylan on that and wonder where the perf win
>>>> could come from. Let's assume TCP, sends are usually completed
>>>> in the same syscall, otherwise your pacing is just bad. Thrift,
>>>> for example, collects sends and packs into one multi iov request
>>>> during a loop iteration. If the req completes immediately then
>>>> the userspace just wouldn't have time to push more buffers by
>>>> definition (assuming single threading).
>>>
>>> The problem only occurs when they don't complete inline, and now you get
>>> reordering. The application could of course attempt to do proper pacing
>>> and see if it can avoid that condition. If not, it now needs to
>>
>> Ok, I admit that there are more than valid cases when artificial pacing
>> is not an option, which is why I also laid out the polling case.
>> Let's also say that limits potential perf wins to streaming and very
>> large transfers (like files), not "lots of relatively small
>> request-response" kinds of apps.
> 
> I don't think that's true - if you're doing large streaming, you're more
> likely to keep the socket buffer full, whereas for smallish sends, it's
> less likely to be full. Testing with the silly proxy confirms that. And

I don't see any contradiction to what I said. With streaming/large
sends it's more likely to be polled. For small sends and
send-receive-send-... patterns the sock queue is unlikely to be
full, in which case the send is processed inline, and so the
feature doesn't add performance, as you agreed a couple email
before.

> outside of cases where pacing just isn't feasible, it's extra overhead
> for cases where you potentially could or what.

I lost it, what overhead?

> To me, the main appeal of this is the simplicity.

I'd argue it doesn't seem any simpler than the alternative.

>>> serialize sends. Using provided buffers makes this very easy, as you
>>> don't need to care about it at all, and it eliminates complexity in the
>>> application dealing with this.
>>
>> If I'm correct the example also serialises sends(?). I don't
>> think it's that simpler. You batch, you send. Same with this,
>> but batch into a provided buffer and the send is conditional.
> 
> Do you mean the proxy example? Just want to be sure we're talking about

Yes, proxy, the one you referenced in the CV. And FWIW, I don't
think it's a fair comparison without batching followed by multi-iov.

> the same thing. Yes it has to serialize sends, because otherwise we can
> run into the condition described in the patch that adds provided buffer
> support for send. But I did bench multishot separately from there,
> here's some of it:
> 
> 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte packets.
> Send ring and send multishot not used:
> 
> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw
> =====================================================
> 1000   |    No	   |  No   |   437  | 1.22M | 9598M
> 32     |    No	   |  No   |  5856  | 2.87M |  734M
> 
> Same test, now turn on send ring:
> 
> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
> ===========================================================
> 1000   |    Yes	   |  No   |   436  | 1.23M | 9620M | + 0.2%
> 32     |    Yes	   |  No   |  3462  | 4.85M | 1237M | +68.5%
> 
> Same test, now turn on send mshot as well:
> 
> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
> ===========================================================
> 1000   |    Yes	   |  Yes  |   436  | 1.23M | 9620M | + 0.2%
> 32     |    Yes	   |  Yes  |  3125  | 5.37M | 1374M | +87.2%
> 
> which does show that there's another win on top for just queueing these
> sends and doing a single send to handle them, rather than needing to
> prepare a send for each buffer. Part of that may be that you simply run
> out of SQEs and then have to submit regardless of where you are at.

How many sockets did you test with? It's 1 SQE per sock max

+87% sounds like a huge difference, and I don't understand where
it comes from, hence the question

>> Another downside is that you need a provided queue per socket,
>> which sounds pretty expensive for 100s if not 1000s socket
>> apps.
> 
> That's certainly true. But either you need backlog per socket anyway in
> the app, or you only send single buffers anyway (in a typical
> request/response kind of fashion) between receives and you don't need it
> at all.

That's pinning pages and maping them, which surely is not bad
but with everything else equal malloc()/stack alloc is much
nicer in terms of resources. (Not talking about CPU setup
overhead).

>>>> If you actually need to poll tx, you send a request and collect
>>>> data into iov in userspace in background. When the request
>>>> completes you send all that in batch. You can probably find
>>>> a niche example when batch=1 in this case, but I don't think
>>>> anyone would care.
>>>>
>>>> The example doesn't use multi-iov, and also still has to
>>>> serialise requests, which naturally serialises buffer consumption
>>>> w/o provided bufs.
>>>
>>> IMHO there's no reason NOT to have both a send with provided buffers and
>>> a multishot send. The alternative would be to have send-N, where you
>>> pass in N. But I don't see much point to that over "just drain the whole
>>> pending list". The obvious use case is definitely send multishot, but
>>
>> Not sure I follow, but in all cases I was contemplating about
>> you sends everything you have at the moment.
>>
>>> what would the reasoning be to prohibit pacing by explicitly disallowing
>>> only doing a single buffer (or a partial queue)? As mentioned earlier, I
>>> like keeping the symmetry with the receive side for multishot, and not
>>> make it any different unless there's a reason to.
>>
>> There are different, buffer content kernel (rx) vs userspace (tx)
>> provided, provided queue / group per socket vs shared. Wake ups
>> for multishots as per below. It's not like it's a one line change,
>> so IMHO requires to be giving some benefits.
> 
> Are you talking about provided buffers, or multishot specifically? I

I assumed that any of them would retry until the queue is exhausted,
at least that sounds more efficient and used in all comments.

> think both are standalone pretty much as simple as they can be. And if
> the argument is "just have send with provided buffers be multishot by
> default",

It's not, rx and tx are different, e.g. true tx multishot doesn't
seem to be possible because of that.

> then that single patch is basically the two patches combined.
> There's no simplification there. Outside of a strong argument for why it
> would never make sense to do single shot send with provided buffers, I
> really don't want to combine them into one single action.

In the current form it does make more sense to have
multishot optionally.

>>>>>> You do make a good point about MSG_WAITALL though - multishot send
>>>>>> doesn't really make sense to me without MSG_WAITALL semantics. I
>>>>>> cannot imagine a useful use case where the first buffer being
>>>>>> partially sent will still want the second buffer sent.
>>>>>
>>>>> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
>>>>> make it implied for multishot send. Currently the code doesn't deal with
>>>>> that.
>>>>>
>>>>> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
>>>>> CQE_F_MORE and we just stop. If it is set, then we go through the usual
>>>>> retry logic. That would make it identical to MSG_WAITALL send without
>>>>> multishot, which again is something I like in that we don't have
>>>>> different behaviors depending on which mode we are using.
>>>>>
>>>>>>>> I actually could imagine it being useful for the previous patches' use
>>>>>>>> case of queuing up sends and keeping ordering,
>>>>>>>> and I think the API is more obvious (rather than the second CQE
>>>>>>>> sending the first CQE's data). So maybe it's worth only
>>>>>>>> keeping one approach?
>>>>>>>
>>>>>>> And here you totally lost me :-)
>>>>>>
>>>>>> I am suggesting here that you don't really need to support buffer
>>>>>> lists on send without multishot.
>>>>>
>>>>> That is certainly true, but I also don't see a reason _not_ to support
>>>>> it. Again mostly because this is how receive and everything else works.
>>>>> The app is free to issue a single SQE for send without multishot, and
>>>>> pick the first buffer and send it.
>>>>
>>>> Multishot sound interesting, but I don't see it much useful if
>>>> you terminate when there are no buffers. Otherwise, if it continues
>>>> to sit in, someone would have to wake it up
>>>
>>> I did think about the termination case, and the problem is that if there
>>> are no buffers, you need it to wake when there are buffers. And at that
>>> point you may as well just do another send, as you need the application
>>> to trigger it. The alternative would be to invent a way to trigger that
>>> wakeup, which would be send only and weird just because of that.
>>
>> Yeah, that's the point, wake ups would be userspace driven, and how
>> to do it without heavy stuff like syscalls is not so clear.
> 
> It's just not possible without eg polling, either directly or using some
> monitor/mwait arch specific thing which would be awful. Or by doing some
> manual wakeup, which would need to lookup and kick the request, which I
> bet would be worse than just re-arming the send multishot.

Right

> If you could poll trigger it somehow, it also further complicates things
> as now it could potentially happen at any time. As it stands, the app
> knows when a poll multishot is armed (and submitted, or not), and can
> serialize with the outgoing buffer queue trivially.
Dylan Yudaken Feb. 26, 2024, 7:31 p.m. UTC | #10
On Mon, Feb 26, 2024 at 2:27 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> > You do make a good point about MSG_WAITALL though - multishot send
> > doesn't really make sense to me without MSG_WAITALL semantics. I
> > cannot imagine a useful use case where the first buffer being
> > partially sent will still want the second buffer sent.
>
> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
> make it implied for multishot send. Currently the code doesn't deal with
> that.
>
> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
> CQE_F_MORE and we just stop. If it is set, then we go through the usual
> retry logic. That would make it identical to MSG_WAITALL send without
> multishot, which again is something I like in that we don't have
> different behaviors depending on which mode we are using.
>

It sounds like the right approach and is reasonably obvious. (I see
this is in v4 already)

Dylan
Jens Axboe Feb. 26, 2024, 7:49 p.m. UTC | #11
On 2/26/24 12:31 PM, Dylan Yudaken wrote:
> On Mon, Feb 26, 2024 at 2:27?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>>> You do make a good point about MSG_WAITALL though - multishot send
>>> doesn't really make sense to me without MSG_WAITALL semantics. I
>>> cannot imagine a useful use case where the first buffer being
>>> partially sent will still want the second buffer sent.
>>
>> Right, and I need to tweak that. Maybe we require MSG_WAITALL, or we
>> make it implied for multishot send. Currently the code doesn't deal with
>> that.
>>
>> Maybe if MSG_WAITALL isn't set and we get a short send we don't set
>> CQE_F_MORE and we just stop. If it is set, then we go through the usual
>> retry logic. That would make it identical to MSG_WAITALL send without
>> multishot, which again is something I like in that we don't have
>> different behaviors depending on which mode we are using.
>>
> 
> It sounds like the right approach and is reasonably obvious. (I see
> this is in v4 already)

Yep, thanks for bringing attention to it! I wrote it up in the man pages
as well. At least to me, it's what I would expect to be the case.
Jens Axboe Feb. 26, 2024, 8:12 p.m. UTC | #12
On 2/26/24 12:21 PM, Pavel Begunkov wrote:
> On 2/26/24 19:11, Jens Axboe wrote:
>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
>>>>>>> <axboe@kernel.dk> wrote:
>>>>>>>> 
>>>>>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>>>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe
>>>>>>>>> <axboe@kernel.dk> wrote:
>>>>>>>>>> 
>>>>>>>>>> This works very much like the receive side, except
>>>>>>>>>> for sends. The idea is that an application can fill
>>>>>>>>>> outgoing buffers in a provided buffer group, and
>>>>>>>>>> then arm a single send that will service them all.
>>>>>>>>>> For now this variant just terminates when we are
>>>>>>>>>> out of buffers to send, and hence the application
>>>>>>>>>> needs to re-arm it if IORING_CQE_F_MORE isn't set,
>>>>>>>>>> as per usual for multishot requests.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This feels to me a lot like just using OP_SEND with
>>>>>>>>> MSG_WAITALL as described, unless I'm missing
>>>>>>>>> something?
>>>>>>>> 
>>>>>>>> How so? MSG_WAITALL is "send X amount of data, and if
>>>>>>>> it's a short send, try again" where multishot is "send
>>>>>>>> data from this buffer group, and keep sending data
>>>>>>>> until it's empty". Hence it's the mirror of multishot 
>>>>>>>> on the receive side. Unless I'm misunderstanding you
>>>>>>>> somehow, not sure it'd be smart to add special meaning
>>>>>>>> to MSG_WAITALL with provided buffers.
>>>>>>>> 
>>>>>>> 
>>>>>>> _If_ you have the data upfront these are very similar,
>>>>>>> and only differ in that the multishot approach will give
>>>>>>> you more granular progress updates. My point was that
>>>>>>> this might not be a valuable API to people for only this
>>>>>>> use case.
>>>>>> 
>>>>>> Not sure I agree, it feels like attributing a different
>>>>>> meaning to MSG_WAITALL if you use a provided buffer vs if
>>>>>> you don't. And that to me would seem to be confusing.
>>>>>> Particularly when we have multishot on the receive side,
>>>>>> and this is identical, just for sends. Receives will keep 
>>>>>> receiving as long as there are buffers in the provided
>>>>>> group to receive into, and sends will keep sending for the
>>>>>> same condition. Either one will terminate if we run out of
>>>>>> buffers.
>>>>>> 
>>>>>> If you make MSG_WAITALL be that for provided buffers +
>>>>>> send, then that behaves differently than MSG_WAITALL with
>>>>>> receive, and MSG_WAITALL with send _without_ provided
>>>>>> buffers. I don't think overloading an existing flag for
>>>>>> this purposes is a good idea, particularly when we already
>>>>>> have the existing semantics for multishot on the receive
>>>>>> side.
>>>>> 
>>>>> I'm actually with Dylan on that and wonder where the perf
>>>>> win could come from. Let's assume TCP, sends are usually
>>>>> completed in the same syscall, otherwise your pacing is just
>>>>> bad. Thrift, for example, collects sends and packs into one
>>>>> multi iov request during a loop iteration. If the req
>>>>> completes immediately then the userspace just wouldn't have
>>>>> time to push more buffers by definition (assuming single
>>>>> threading).
>>>> 
>>>> The problem only occurs when they don't complete inline, and
>>>> now you get reordering. The application could of course attempt
>>>> to do proper pacing and see if it can avoid that condition. If
>>>> not, it now needs to
>>> 
>>> Ok, I admit that there are more than valid cases when artificial
>>> pacing is not an option, which is why I also laid out the polling
>>> case. Let's also say that limits potential perf wins to streaming
>>> and very large transfers (like files), not "lots of relatively
>>> small request-response" kinds of apps.
>> 
>> I don't think that's true - if you're doing large streaming, you're
>> more likely to keep the socket buffer full, whereas for smallish
>> sends, it's less likely to be full. Testing with the silly proxy
>> confirms that. And
> 
> I don't see any contradiction to what I said. With streaming/large 
> sends it's more likely to be polled. For small sends and 
> send-receive-send-... patterns the sock queue is unlikely to be full,
> in which case the send is processed inline, and so the feature
> doesn't add performance, as you agreed a couple email before.

Gotcha, I guess I misread you, we agree that the poll side is more
likely on bigger buffers.

>> outside of cases where pacing just isn't feasible, it's extra
>> overhead for cases where you potentially could or what.
> 
> I lost it, what overhead?

Overhead of needing to serialize the sends in the application, which may
include both extra memory needed and overhead in dealing with it.

>> To me, the main appeal of this is the simplicity.
> 
> I'd argue it doesn't seem any simpler than the alternative.

It's certainly simpler for an application to do "add buffer to queue"
and not need to worry about managing sends, than it is to manage a
backlog of only having a single send active.

>>>> serialize sends. Using provided buffers makes this very easy,
>>>> as you don't need to care about it at all, and it eliminates
>>>> complexity in the application dealing with this.
>>> 
>>> If I'm correct the example also serialises sends(?). I don't 
>>> think it's that simpler. You batch, you send. Same with this, but
>>> batch into a provided buffer and the send is conditional.
>> 
>> Do you mean the proxy example? Just want to be sure we're talking
>> about
> 
> Yes, proxy, the one you referenced in the CV. And FWIW, I don't think
> it's a fair comparison without batching followed by multi-iov.

It's not about vectored vs non-vectored IO, though you could of course
need to allocate an arbitrarily sized iovec that you can append to. And
now you need to use sendmsg rather than just send, which has further
overhead on top of send.

What kind of batching? The batching done by the tests are the same,
regardless of whether or not send multishot is used in the sense that we
wait on the same number of completions. As it's a basic proxy kind of
thing, it'll receive a packet and send a packet. Submission batching is
the same too, we'll submit when we have to.

>> the same thing. Yes it has to serialize sends, because otherwise we
>> can run into the condition described in the patch that adds
>> provided buffer support for send. But I did bench multishot
>> separately from there, here's some of it:
>> 
>> 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte
>> packets. Send ring and send multishot not used:
>> 
>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw 
>> ===================================================== 1000   |
>> No       |  No   |   437  | 1.22M | 9598M 32     |    No       |
>> No   |  5856  | 2.87M |  734M
>> 
>> Same test, now turn on send ring:
>> 
>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff 
>> =========================================================== 1000
>> |    Yes       |  No   |   436  | 1.23M | 9620M | + 0.2% 32     |
>> Yes       |  No   |  3462  | 4.85M | 1237M | +68.5%
>> 
>> Same test, now turn on send mshot as well:
>> 
>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff 
>> =========================================================== 1000
>> |    Yes       |  Yes  |   436  | 1.23M | 9620M | + 0.2% 32     |
>> Yes       |  Yes  |  3125  | 5.37M | 1374M | +87.2%
>> 
>> which does show that there's another win on top for just queueing
>> these sends and doing a single send to handle them, rather than
>> needing to prepare a send for each buffer. Part of that may be that
>> you simply run out of SQEs and then have to submit regardless of
>> where you are at.
> 
> How many sockets did you test with? It's 1 SQE per sock max

The above is just one, but I've run it with a lot more sockets. Nothing
ilke thousands, but 64-128.

> +87% sounds like a huge difference, and I don't understand where it
> comes from, hence the question

There are several things:

1) Fact is that the app has to serialize sends for the unlikely case
   of sends being reordered because of the condition outlined in the
   patch that enables provided buffer support for send. This is the
   largest win, particularly with smaller packets, as it ruins the
   send pipeline.

2) We're posting fewer SQEs. That's the multishot win. Obivously not
   as large, but it does help.

People have asked in the past on how to serialize sends, and I've had to
tell them that it isn't really possible. The only option we had was
using drain or links, which aren't ideal nor very flexible. Using
provided buffers finally gives the application a way to do that without
needing to do anything really. Does every application need it? Certainly
not, but for the ones that do, I do think it provides a great
alternative that's better performing than doing single sends at the
time.

>>> Another downside is that you need a provided queue per socket, 
>>> which sounds pretty expensive for 100s if not 1000s socket apps.
>> 
>> That's certainly true. But either you need backlog per socket
>> anyway in the app, or you only send single buffers anyway (in a
>> typical request/response kind of fashion) between receives and you
>> don't need it at all.
> 
> That's pinning pages and maping them, which surely is not bad but
> with everything else equal malloc()/stack alloc is much nicer in
> terms of resources. (Not talking about CPU setup overhead).

Sure, it's not free in terms of memory either. As mentioned several
times, the main win is on efficiency and in reducing complexity, and
both of those are pretty nice imho.

>>>>> If you actually need to poll tx, you send a request and
>>>>> collect data into iov in userspace in background. When the
>>>>> request completes you send all that in batch. You can
>>>>> probably find a niche example when batch=1 in this case, but
>>>>> I don't think anyone would care.
>>>>> 
>>>>> The example doesn't use multi-iov, and also still has to 
>>>>> serialise requests, which naturally serialises buffer
>>>>> consumption w/o provided bufs.
>>>> 
>>>> IMHO there's no reason NOT to have both a send with provided
>>>> buffers and a multishot send. The alternative would be to have
>>>> send-N, where you pass in N. But I don't see much point to that
>>>> over "just drain the whole pending list". The obvious use case
>>>> is definitely send multishot, but
>>> 
>>> Not sure I follow, but in all cases I was contemplating about you
>>> sends everything you have at the moment.
>>> 
>>>> what would the reasoning be to prohibit pacing by explicitly
>>>> disallowing only doing a single buffer (or a partial queue)? As
>>>> mentioned earlier, I like keeping the symmetry with the receive
>>>> side for multishot, and not make it any different unless
>>>> there's a reason to.
>>> 
>>> There are different, buffer content kernel (rx) vs userspace
>>> (tx) provided, provided queue / group per socket vs shared. Wake
>>> ups for multishots as per below. It's not like it's a one line
>>> change, so IMHO requires to be giving some benefits.
>> 
>> Are you talking about provided buffers, or multishot specifically?
>> I
> 
> I assumed that any of them would retry until the queue is exhausted, 
> at least that sounds more efficient and used in all comments.

That is what it does, it'll keep sending until it runs out of buffers
(or hits an error, short send, whatever).

>> think both are standalone pretty much as simple as they can be. And
>> if the argument is "just have send with provided buffers be
>> multishot by default",
> 
> It's not, rx and tx are different, e.g. true tx multishot doesn't 
> seem to be possible because of that.

In the sense that rx and poll trigger on data now being available isn't
feasible on send, yeah they are not exact mirrors of each other. But
they are as close as they can be. If there was, or ever will be, an
efficient way to re-trigger a multishot send, that would certainly be a
doable and an easy addition to make on top of this. It really only
changes the termination point, if you run out of buffers you just go to
whatever arming method would be suitable for that. But since the reason
for recv multishot is to avoid hammering on the locking on the poll
side, I'm not convinced that having a perpetual multishot send would
make a lot more sense than simply doing another one when needed. If
you're socket buffer bound on multishot send, then the perpetual poll
trigger works and is useful.

>> then that single patch is basically the two patches combined. 
>> There's no simplification there. Outside of a strong argument for
>> why it would never make sense to do single shot send with provided
>> buffers, I really don't want to combine them into one single
>> action.
> 
> In the current form it does make more sense to have multishot
> optionally.

I obviously agree on that too, kept them separate in the v4 posting as
well.
Pavel Begunkov Feb. 26, 2024, 8:51 p.m. UTC | #13
On 2/26/24 20:12, Jens Axboe wrote:
> On 2/26/24 12:21 PM, Pavel Begunkov wrote:
>> On 2/26/24 19:11, Jens Axboe wrote:
>>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
...
>>> I don't think that's true - if you're doing large streaming, you're
>>> more likely to keep the socket buffer full, whereas for smallish
>>> sends, it's less likely to be full. Testing with the silly proxy
>>> confirms that. And
>>
>> I don't see any contradiction to what I said. With streaming/large
>> sends it's more likely to be polled. For small sends and
>> send-receive-send-... patterns the sock queue is unlikely to be full,
>> in which case the send is processed inline, and so the feature
>> doesn't add performance, as you agreed a couple email before.
> 
> Gotcha, I guess I misread you, we agree that the poll side is more
> likely on bigger buffers.
> 
>>> outside of cases where pacing just isn't feasible, it's extra
>>> overhead for cases where you potentially could or what.
>>
>> I lost it, what overhead?
> 
> Overhead of needing to serialize the sends in the application, which may
> include both extra memory needed and overhead in dealing with it.

I think I misread the code. Does it push 1 request for each
send buffer / queue_send() in case of provided bufs?

Anyway, the overhead of serialisation would be negligent.
And that's same extra memory you keep for the provided buffer
pool, and you can allocate it once. Also consider that provided
buffers are fixed size and it'd be hard to resize without waiting,
thus the userspace would still need to have another, userspace
backlog, it can't just drop requests. Or you make provided queues
extra large, but it's per socket and you'd wasting lots of memory.

IOW, I don't think this overhead could anyhow close us to
the understanding of the 30%+ perf gap.
  
>>> To me, the main appeal of this is the simplicity.
>>
>> I'd argue it doesn't seem any simpler than the alternative.
> 
> It's certainly simpler for an application to do "add buffer to queue"
> and not need to worry about managing sends, than it is to manage a
> backlog of only having a single send active.

They still need to manage / re-queue sends. And maybe I
misunderstand the point, but it's only one request inflight
per socket in either case.
  

>>>>> serialize sends. Using provided buffers makes this very easy,
>>>>> as you don't need to care about it at all, and it eliminates
>>>>> complexity in the application dealing with this.
>>>>
>>>> If I'm correct the example also serialises sends(?). I don't
>>>> think it's that simpler. You batch, you send. Same with this, but
>>>> batch into a provided buffer and the send is conditional.
>>>
>>> Do you mean the proxy example? Just want to be sure we're talking
>>> about
>>
>> Yes, proxy, the one you referenced in the CV. And FWIW, I don't think
>> it's a fair comparison without batching followed by multi-iov.
> 
> It's not about vectored vs non-vectored IO, though you could of course
> need to allocate an arbitrarily sized iovec that you can append to. And
> now you need to use sendmsg rather than just send, which has further
> overhead on top of send.

That's not nearly enough of overhead to explain the difference,
I don't believe so, going through the net stack is quite expensive.

> What kind of batching? The batching done by the tests are the same,
> regardless of whether or not send multishot is used in the sense that we

You can say that, but I say that it moves into the kernel
batching that can be implemented in userspace.

> wait on the same number of completions. As it's a basic proxy kind of
> thing, it'll receive a packet and send a packet. Submission batching is
> the same too, we'll submit when we have to.

"If you actually need to poll tx, you send a request and collect
data into iov in userspace in background. When the request
completes you send all that in batch..."

That's how it's in Thrift for example.

In terms of "proxy", the first approximation would be to
do sth like defer_send() for normal requests as well, then

static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
			 void *data, int bid, int len)
{
	...

	defer_send(data);

	while (buf = defer_backlog.get()) {
		iov[idx++] = buf;
	}
	msghdr->iovlen = idx;
	...
}

>>> the same thing. Yes it has to serialize sends, because otherwise we
>>> can run into the condition described in the patch that adds
>>> provided buffer support for send. But I did bench multishot
>>> separately from there, here's some of it:
>>>
>>> 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte
>>> packets. Send ring and send multishot not used:
>>>
>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw
>>> ===================================================== 1000   |
>>> No       |  No   |   437  | 1.22M | 9598M 32     |    No       |
>>> No   |  5856  | 2.87M |  734M
>>>
>>> Same test, now turn on send ring:
>>>
>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
>>> =========================================================== 1000
>>> |    Yes       |  No   |   436  | 1.23M | 9620M | + 0.2% 32     |
>>> Yes       |  No   |  3462  | 4.85M | 1237M | +68.5%
>>>
>>> Same test, now turn on send mshot as well:
>>>
>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
>>> =========================================================== 1000
>>> |    Yes       |  Yes  |   436  | 1.23M | 9620M | + 0.2% 32     |
>>> Yes       |  Yes  |  3125  | 5.37M | 1374M | +87.2%
>>>
>>> which does show that there's another win on top for just queueing
>>> these sends and doing a single send to handle them, rather than
>>> needing to prepare a send for each buffer. Part of that may be that
>>> you simply run out of SQEs and then have to submit regardless of
>>> where you are at.
>>
>> How many sockets did you test with? It's 1 SQE per sock max
> 
> The above is just one, but I've run it with a lot more sockets. Nothing
> ilke thousands, but 64-128.
> 
>> +87% sounds like a huge difference, and I don't understand where it
>> comes from, hence the question
> 
> There are several things:
> 
> 1) Fact is that the app has to serialize sends for the unlikely case
>     of sends being reordered because of the condition outlined in the
>     patch that enables provided buffer support for send. This is the
>     largest win, particularly with smaller packets, as it ruins the
>     send pipeline.

Do those small packets force it to poll?

> 2) We're posting fewer SQEs. That's the multishot win. Obivously not
>     as large, but it does help.
> 
> People have asked in the past on how to serialize sends, and I've had to
> tell them that it isn't really possible. The only option we had was
> using drain or links, which aren't ideal nor very flexible. Using
> provided buffers finally gives the application a way to do that without
> needing to do anything really. Does every application need it? Certainly
> not, but for the ones that do, I do think it provides a great
> alternative that's better performing than doing single sends at the
> time.

As per note on additional userspace backlog, any real generic app
and especially libs would need to do more to support it.
Jens Axboe Feb. 26, 2024, 9:27 p.m. UTC | #14
On 2/26/24 1:51 PM, Pavel Begunkov wrote:
> On 2/26/24 20:12, Jens Axboe wrote:
>> On 2/26/24 12:21 PM, Pavel Begunkov wrote:
>>> On 2/26/24 19:11, Jens Axboe wrote:
>>>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
> ...
>>>> I don't think that's true - if you're doing large streaming, you're
>>>> more likely to keep the socket buffer full, whereas for smallish
>>>> sends, it's less likely to be full. Testing with the silly proxy
>>>> confirms that. And
>>>
>>> I don't see any contradiction to what I said. With streaming/large
>>> sends it's more likely to be polled. For small sends and
>>> send-receive-send-... patterns the sock queue is unlikely to be full,
>>> in which case the send is processed inline, and so the feature
>>> doesn't add performance, as you agreed a couple email before.
>>
>> Gotcha, I guess I misread you, we agree that the poll side is more
>> likely on bigger buffers.
>>
>>>> outside of cases where pacing just isn't feasible, it's extra
>>>> overhead for cases where you potentially could or what.
>>>
>>> I lost it, what overhead?
>>
>> Overhead of needing to serialize the sends in the application, which may
>> include both extra memory needed and overhead in dealing with it.
> 
> I think I misread the code. Does it push 1 request for each
> send buffer / queue_send() in case of provided bufs?

Right, that's the way it's currently setup. Per send (per loop), if
you're using provided buffers, it'll do a send per buffer. If using
multishot on top of that, it'll do one send per loop regardless of the
number of buffers.

> Anyway, the overhead of serialisation would be negligent.
> And that's same extra memory you keep for the provided buffer
> pool, and you can allocate it once. Also consider that provided
> buffers are fixed size and it'd be hard to resize without waiting,
> thus the userspace would still need to have another, userspace
> backlog, it can't just drop requests. Or you make provided queues
> extra large, but it's per socket and you'd wasting lots of memory.
> 
> IOW, I don't think this overhead could anyhow close us to
> the understanding of the 30%+ perf gap.

The 32-byte case is obviously somewhat pathological, as you're going to
be much better off having a bunch of these pipelined rather than issued
serially. As you can see from the 1000 byte packets, at that point it
doesn't matter that much. It's mostly about making it simpler at that
point.

>>>> To me, the main appeal of this is the simplicity.
>>>
>>> I'd argue it doesn't seem any simpler than the alternative.
>>
>> It's certainly simpler for an application to do "add buffer to queue"
>> and not need to worry about managing sends, than it is to manage a
>> backlog of only having a single send active.
> 
> They still need to manage / re-queue sends. And maybe I
> misunderstand the point, but it's only one request inflight
> per socket in either case.

Sure, but one is a manageable condition, the other one is not. If you
can keep N inflight at the same time and only abort the chain in case of
error/short send, that's a corner case. Versus not knowing when things
get reordered, and hence always needing to serialize.

>>>>>> serialize sends. Using provided buffers makes this very easy,
>>>>>> as you don't need to care about it at all, and it eliminates
>>>>>> complexity in the application dealing with this.
>>>>>
>>>>> If I'm correct the example also serialises sends(?). I don't
>>>>> think it's that simpler. You batch, you send. Same with this, but
>>>>> batch into a provided buffer and the send is conditional.
>>>>
>>>> Do you mean the proxy example? Just want to be sure we're talking
>>>> about
>>>
>>> Yes, proxy, the one you referenced in the CV. And FWIW, I don't think
>>> it's a fair comparison without batching followed by multi-iov.
>>
>> It's not about vectored vs non-vectored IO, though you could of course
>> need to allocate an arbitrarily sized iovec that you can append to. And
>> now you need to use sendmsg rather than just send, which has further
>> overhead on top of send.
> 
> That's not nearly enough of overhead to explain the difference,
> I don't believe so, going through the net stack is quite expensive.

See above, for the 32-byte packets, it's not hard to imagine big wins by
having many shoved in vs doing them piecemeal.

And honestly, I was surprised at how well the stack deals with this on
the networking side! It may have room for improvement, but it's not
nearly as sluggish as I feared.

>> What kind of batching? The batching done by the tests are the same,
>> regardless of whether or not send multishot is used in the sense that we
> 
> You can say that, but I say that it moves into the kernel
> batching that can be implemented in userspace.

And then most people get it wrong or just do the basic stuff, and
performance isn't very good. Getting the most out of it can be tricky
and require extensive testing and knowledge building. I'm confident
you'd be able to write an efficient version, but that's not the same as
saying "it's trivial to write an efficient version".

>> wait on the same number of completions. As it's a basic proxy kind of
>> thing, it'll receive a packet and send a packet. Submission batching is
>> the same too, we'll submit when we have to.
> 
> "If you actually need to poll tx, you send a request and collect
> data into iov in userspace in background. When the request
> completes you send all that in batch..."
> 
> That's how it's in Thrift for example.
> 
> In terms of "proxy", the first approximation would be to
> do sth like defer_send() for normal requests as well, then
> 
> static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
>              void *data, int bid, int len)
> {
>     ...
> 
>     defer_send(data);
> 
>     while (buf = defer_backlog.get()) {
>         iov[idx++] = buf;
>     }
>     msghdr->iovlen = idx;
>     ...
> }

Yep, that's the iovec coalescing, and that could certainly be done. And
then you need to size the iov[] so that it's always big enough, OR
submit that send and still deal with managing your own backlog.

I don't think we disagree that there are other solutions. I'm saying
that I like this solution. I think it's simple to use for the cases that
can use it, and that's why the patches exist. It fits with the notion of
an async API being able to keep multiple things in flight, rather than a
semi solution where you kind of can, except not for cases X and Y
because of corner cases.

>>>> the same thing. Yes it has to serialize sends, because otherwise we
>>>> can run into the condition described in the patch that adds
>>>> provided buffer support for send. But I did bench multishot
>>>> separately from there, here's some of it:
>>>>
>>>> 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte
>>>> packets. Send ring and send multishot not used:
>>>>
>>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw
>>>> ===================================================== 1000   |
>>>> No       |  No   |   437  | 1.22M | 9598M 32     |    No       |
>>>> No   |  5856  | 2.87M |  734M
>>>>
>>>> Same test, now turn on send ring:
>>>>
>>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
>>>> =========================================================== 1000
>>>> |    Yes       |  No   |   436  | 1.23M | 9620M | + 0.2% 32     |
>>>> Yes       |  No   |  3462  | 4.85M | 1237M | +68.5%
>>>>
>>>> Same test, now turn on send mshot as well:
>>>>
>>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
>>>> =========================================================== 1000
>>>> |    Yes       |  Yes  |   436  | 1.23M | 9620M | + 0.2% 32     |
>>>> Yes       |  Yes  |  3125  | 5.37M | 1374M | +87.2%
>>>>
>>>> which does show that there's another win on top for just queueing
>>>> these sends and doing a single send to handle them, rather than
>>>> needing to prepare a send for each buffer. Part of that may be that
>>>> you simply run out of SQEs and then have to submit regardless of
>>>> where you are at.
>>>
>>> How many sockets did you test with? It's 1 SQE per sock max
>>
>> The above is just one, but I've run it with a lot more sockets. Nothing
>> ilke thousands, but 64-128.
>>
>>> +87% sounds like a huge difference, and I don't understand where it
>>> comes from, hence the question
>>
>> There are several things:
>>
>> 1) Fact is that the app has to serialize sends for the unlikely case
>>     of sends being reordered because of the condition outlined in the
>>     patch that enables provided buffer support for send. This is the
>>     largest win, particularly with smaller packets, as it ruins the
>>     send pipeline.
> 
> Do those small packets force it to poll?

There's no polling in my testing.

>> 2) We're posting fewer SQEs. That's the multishot win. Obivously not
>>     as large, but it does help.
>>
>> People have asked in the past on how to serialize sends, and I've had to
>> tell them that it isn't really possible. The only option we had was
>> using drain or links, which aren't ideal nor very flexible. Using
>> provided buffers finally gives the application a way to do that without
>> needing to do anything really. Does every application need it? Certainly
>> not, but for the ones that do, I do think it provides a great
>> alternative that's better performing than doing single sends at the
>> time.
> 
> As per note on additional userspace backlog, any real generic app
> and especially libs would need to do more to support it.

Sure, if you get a short send or any abort in the chain, you need to
handle it. But things stall/stop at that point and you handle it, and
then you're back up and running. This is vastly different from "oh I
always need to serialize because X or Y may happen, even though it
rarely does or never does in my case".
Pavel Begunkov Feb. 28, 2024, 12:39 p.m. UTC | #15
On 2/26/24 21:27, Jens Axboe wrote:
> On 2/26/24 1:51 PM, Pavel Begunkov wrote:
>> On 2/26/24 20:12, Jens Axboe wrote:
>>> On 2/26/24 12:21 PM, Pavel Begunkov wrote:
>>>> On 2/26/24 19:11, Jens Axboe wrote:
>>>>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>>>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
>> ...
>>>>> I don't think that's true - if you're doing large streaming, you're
>>>>> more likely to keep the socket buffer full, whereas for smallish
>>>>> sends, it's less likely to be full. Testing with the silly proxy
>>>>> confirms that. And
>>>>
>>>> I don't see any contradiction to what I said. With streaming/large
>>>> sends it's more likely to be polled. For small sends and
>>>> send-receive-send-... patterns the sock queue is unlikely to be full,
>>>> in which case the send is processed inline, and so the feature
>>>> doesn't add performance, as you agreed a couple email before.
>>>
>>> Gotcha, I guess I misread you, we agree that the poll side is more
>>> likely on bigger buffers.
>>>
>>>>> outside of cases where pacing just isn't feasible, it's extra
>>>>> overhead for cases where you potentially could or what.
>>>>
>>>> I lost it, what overhead?
>>>
>>> Overhead of needing to serialize the sends in the application, which may
>>> include both extra memory needed and overhead in dealing with it.
>>
>> I think I misread the code. Does it push 1 request for each
>> send buffer / queue_send() in case of provided bufs?
> 
> Right, that's the way it's currently setup. Per send (per loop), if
> you're using provided buffers, it'll do a send per buffer. If using
> multishot on top of that, it'll do one send per loop regardless of the
> number of buffers.

Ok, then I'd say the performance tests are misleading, at least
the proxy one. For selected buffers, sending tons of requests sounds
unnecessarily expensive, but I don't have numbers to back it. The
normal case must employ the natural batching it has with
serialisation.

It's important comparing programs that are optimised and close
to what is achievable in userspace. You wouldn't compare it with

while (i = 0; i < nr_bytes; i++)
	send(data+i, bytes=1);

and claim that it's N times faster. Or surely we can add an
empty "for" loop to one side and say "users are stupid and
will put garbage here anyway, so fair game", but then it
needs a note in small font "valid only for users who can't
write code up to standard"

static void defer_send() {
	ps = malloc(sizeof(*ps));
}

fwiw, maybe malloc is not really expensive there, but
that sounds like a pretty bad practice for a hi perf
benchmark.

>> Anyway, the overhead of serialisation would be negligent.
>> And that's same extra memory you keep for the provided buffer
>> pool, and you can allocate it once. Also consider that provided
>> buffers are fixed size and it'd be hard to resize without waiting,
>> thus the userspace would still need to have another, userspace
>> backlog, it can't just drop requests. Or you make provided queues
>> extra large, but it's per socket and you'd wasting lots of memory.
>>
>> IOW, I don't think this overhead could anyhow close us to
>> the understanding of the 30%+ perf gap.
> 
> The 32-byte case is obviously somewhat pathological, as you're going to
> be much better off having a bunch of these pipelined rather than issued
> serially. As you can see from the 1000 byte packets, at that point it
> doesn't matter that much. It's mostly about making it simpler at that
> point.
> 
>>>>> To me, the main appeal of this is the simplicity.
>>>>
>>>> I'd argue it doesn't seem any simpler than the alternative.
>>>
>>> It's certainly simpler for an application to do "add buffer to queue"
>>> and not need to worry about managing sends, than it is to manage a
>>> backlog of only having a single send active.
>>
>> They still need to manage / re-queue sends. And maybe I
>> misunderstand the point, but it's only one request inflight
>> per socket in either case.
> 
> Sure, but one is a manageable condition, the other one is not. If you
> can keep N inflight at the same time and only abort the chain in case of
> error/short send, that's a corner case. Versus not knowing when things
> get reordered, and hence always needing to serialize.

Manageable or not, you still have to implement all that, whereas
serialisation is not complex and I doubt it's anywhere expensive
enough to overturn the picture. It seems that multishot selected
bufs would also need serialisation, and for oneshots managing
multiple requests when you don't know which one sends what buffer
would be complicated in real scenarios.

>>> What kind of batching? The batching done by the tests are the same,
>>> regardless of whether or not send multishot is used in the sense that we
>>
>> You can say that, but I say that it moves into the kernel
>> batching that can be implemented in userspace.
> 
> And then most people get it wrong or just do the basic stuff, and
> performance isn't very good. Getting the most out of it can be tricky
> and require extensive testing and knowledge building. I'm confident
> you'd be able to write an efficient version, but that's not the same as
> saying "it's trivial to write an efficient version".

It actually _is_ trivial to write an efficient version for anyone
competent enough and having a spare day for that, which is usually
in a form of a library

I'm a firm believer of not putting into the kernel what can
already be well implemented in userspace, because the next step
could be "firefox can be implemented in userspace, but it requires
knowledge building, so let's implement it in the kernel". At
least I recall nginx / HTTP servers not flying well

>>> wait on the same number of completions. As it's a basic proxy kind of
>>> thing, it'll receive a packet and send a packet. Submission batching is
>>> the same too, we'll submit when we have to.
>>
>> "If you actually need to poll tx, you send a request and collect
>> data into iov in userspace in background. When the request
>> completes you send all that in batch..."
>>
>> That's how it's in Thrift for example.
>>
>> In terms of "proxy", the first approximation would be to
>> do sth like defer_send() for normal requests as well, then
>>
>> static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
>>               void *data, int bid, int len)
>> {
>>      ...
>>
>>      defer_send(data);
>>
>>      while (buf = defer_backlog.get()) {
>>          iov[idx++] = buf;
>>      }
>>      msghdr->iovlen = idx;
>>      ...
>> }
> 
> Yep, that's the iovec coalescing, and that could certainly be done. And
> then you need to size the iov[] so that it's always big enough, OR
> submit that send and still deal with managing your own backlog.

Which is as trivial as iov.push_back() or a couple of lines with
realloc, whereas for selected buffers you would likely need to
wait for the previous request[s] to complete, which you cannot
do in place.

The point is, it seems that when you'd try to write something
error proof and real life ready, it wouldn't look simpler or
much simpler than the approach suggested, then the performance
is the question.

> I don't think we disagree that there are other solutions. I'm saying
> that I like this solution. I think it's simple to use for the cases that
> can use it, and that's why the patches exist. It fits with the notion of
> an async API being able to keep multiple things in flight, rather than a
> semi solution where you kind of can, except not for cases X and Y
> because of corner cases.
...
Jens Axboe Feb. 28, 2024, 5:28 p.m. UTC | #16
On 2/28/24 5:39 AM, Pavel Begunkov wrote:
> On 2/26/24 21:27, Jens Axboe wrote:
>> On 2/26/24 1:51 PM, Pavel Begunkov wrote:
>>> On 2/26/24 20:12, Jens Axboe wrote:
>>>> On 2/26/24 12:21 PM, Pavel Begunkov wrote:
>>>>> On 2/26/24 19:11, Jens Axboe wrote:
>>>>>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>>>>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>>>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
>>> ...
>>>>>> I don't think that's true - if you're doing large streaming, you're
>>>>>> more likely to keep the socket buffer full, whereas for smallish
>>>>>> sends, it's less likely to be full. Testing with the silly proxy
>>>>>> confirms that. And
>>>>>
>>>>> I don't see any contradiction to what I said. With streaming/large
>>>>> sends it's more likely to be polled. For small sends and
>>>>> send-receive-send-... patterns the sock queue is unlikely to be full,
>>>>> in which case the send is processed inline, and so the feature
>>>>> doesn't add performance, as you agreed a couple email before.
>>>>
>>>> Gotcha, I guess I misread you, we agree that the poll side is more
>>>> likely on bigger buffers.
>>>>
>>>>>> outside of cases where pacing just isn't feasible, it's extra
>>>>>> overhead for cases where you potentially could or what.
>>>>>
>>>>> I lost it, what overhead?
>>>>
>>>> Overhead of needing to serialize the sends in the application, which may
>>>> include both extra memory needed and overhead in dealing with it.
>>>
>>> I think I misread the code. Does it push 1 request for each
>>> send buffer / queue_send() in case of provided bufs?
>>
>> Right, that's the way it's currently setup. Per send (per loop), if
>> you're using provided buffers, it'll do a send per buffer. If using
>> multishot on top of that, it'll do one send per loop regardless of the
>> number of buffers.
> 
> Ok, then I'd say the performance tests are misleading, at least
> the proxy one. For selected buffers, sending tons of requests sounds
> unnecessarily expensive, but I don't have numbers to back it. The
> normal case must employ the natural batching it has with
> serialisation.
> 
> It's important comparing programs that are optimised and close
> to what is achievable in userspace. You wouldn't compare it with
> 
> while (i = 0; i < nr_bytes; i++)
>     send(data+i, bytes=1);
> 
> and claim that it's N times faster. Or surely we can add an
> empty "for" loop to one side and say "users are stupid and
> will put garbage here anyway, so fair game", but then it
> needs a note in small font "valid only for users who can't
> write code up to standard"
> 
> static void defer_send() {
>     ps = malloc(sizeof(*ps));
> }
> 
> fwiw, maybe malloc is not really expensive there, but
> that sounds like a pretty bad practice for a hi perf
> benchmark.

It's comparing using send using a manually managed backlog, or using
send where you don't have to do that. I think that's pretty valid by
itself. If you don't do that, you need to use sendmsg and append iovecs.
Which is obviously also a very valid use case and would avoid the
backlog, at the cost of sendmsg being more expensive than send. If done
right, the sendmsg+append would be a lot closer to send with multishot.

When I have some time I can do add the append case, or feel free to do
that yourself, and I can run some testing with that too.

>>> Anyway, the overhead of serialisation would be negligent.
>>> And that's same extra memory you keep for the provided buffer
>>> pool, and you can allocate it once. Also consider that provided
>>> buffers are fixed size and it'd be hard to resize without waiting,
>>> thus the userspace would still need to have another, userspace
>>> backlog, it can't just drop requests. Or you make provided queues
>>> extra large, but it's per socket and you'd wasting lots of memory.
>>>
>>> IOW, I don't think this overhead could anyhow close us to
>>> the understanding of the 30%+ perf gap.
>>
>> The 32-byte case is obviously somewhat pathological, as you're going to
>> be much better off having a bunch of these pipelined rather than issued
>> serially. As you can see from the 1000 byte packets, at that point it
>> doesn't matter that much. It's mostly about making it simpler at that
>> point.
>>
>>>>>> To me, the main appeal of this is the simplicity.
>>>>>
>>>>> I'd argue it doesn't seem any simpler than the alternative.
>>>>
>>>> It's certainly simpler for an application to do "add buffer to queue"
>>>> and not need to worry about managing sends, than it is to manage a
>>>> backlog of only having a single send active.
>>>
>>> They still need to manage / re-queue sends. And maybe I
>>> misunderstand the point, but it's only one request inflight
>>> per socket in either case.
>>
>> Sure, but one is a manageable condition, the other one is not. If you
>> can keep N inflight at the same time and only abort the chain in case of
>> error/short send, that's a corner case. Versus not knowing when things
>> get reordered, and hence always needing to serialize.
> 
> Manageable or not, you still have to implement all that, whereas
> serialisation is not complex and I doubt it's anywhere expensive
> enough to overturn the picture. It seems that multishot selected
> bufs would also need serialisation, and for oneshots managing
> multiple requests when you don't know which one sends what buffer
> would be complicated in real scenarios.

What "all that"? It's pretty trivial when you have a normal abort
condition to handle it. For the sendmsg case, you can append multiple
iovecs, but you can still only have one sendmsg inflight. If you start
prepping the next one before the previous one has successfully
completed, you have the same issue again.

Only serialization you need is to only have one inflight, which is in
your best interest anyway as it would be more wasteful to submit several
which both empty that particular queue.

>>>> What kind of batching? The batching done by the tests are the same,
>>>> regardless of whether or not send multishot is used in the sense that we
>>>
>>> You can say that, but I say that it moves into the kernel
>>> batching that can be implemented in userspace.
>>
>> And then most people get it wrong or just do the basic stuff, and
>> performance isn't very good. Getting the most out of it can be tricky
>> and require extensive testing and knowledge building. I'm confident
>> you'd be able to write an efficient version, but that's not the same as
>> saying "it's trivial to write an efficient version".
> 
> It actually _is_ trivial to write an efficient version for anyone
> competent enough and having a spare day for that, which is usually
> in a form of a library

If using sendmsg.

> I'm a firm believer of not putting into the kernel what can
> already be well implemented in userspace, because the next step
> could be "firefox can be implemented in userspace, but it requires
> knowledge building, so let's implement it in the kernel". At
> least I recall nginx / HTTP servers not flying well

Sorry but that's nonsense, we add kernel features all the time that
could be done (just less efficiently) in userspace. And the firefox
comment is crazy hyperbole, not relevant here at all.

>>> In terms of "proxy", the first approximation would be to
>>> do sth like defer_send() for normal requests as well, then
>>>
>>> static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
>>>               void *data, int bid, int len)
>>> {
>>>      ...
>>>
>>>      defer_send(data);
>>>
>>>      while (buf = defer_backlog.get()) {
>>>          iov[idx++] = buf;
>>>      }
>>>      msghdr->iovlen = idx;
>>>      ...
>>> }
>>
>> Yep, that's the iovec coalescing, and that could certainly be done. And
>> then you need to size the iov[] so that it's always big enough, OR
>> submit that send and still deal with managing your own backlog.
> 
> Which is as trivial as iov.push_back() or a couple of lines with
> realloc, whereas for selected buffers you would likely need to
> wait for the previous request[s] to complete, which you cannot
> do in place.

Go implement it and we can benchmark. It won't solve the sendmsg being
slower than send situation in general, but it should certainly get a lot
closer in terms of using sendmsg and serializing per send issue.

> The point is, it seems that when you'd try to write something
> error proof and real life ready, it wouldn't look simpler or
> much simpler than the approach suggested, then the performance
> is the question.

The condition for sendmsg with append and send with multishot handling
is basically the same, you need to deal checking the send result and
restarting your send from there. Obviously in practice this will
never/rarely happen unless the connection is aborted anyway. If you use
send multishot with MSG_WAITALL, then if the buffer fills you'll just
restart when it has space, nothing to handle there.
Jens Axboe Feb. 28, 2024, 11:49 p.m. UTC | #17
On 2/28/24 10:28 AM, Jens Axboe wrote:
> When I have some time I can do add the append case, or feel free to do
> that yourself, and I can run some testing with that too.

I did a quick hack to add the append mode, and by default we get roughly
ring_size / 2 number of appended vecs, which I guess is expected.
There's a few complications there:

1) We basically need a per-send data structure at this point. While
   that's not hard to do, at least for my case I'd need to add that just
   for this case and everything would now need to do it. Perhaps. You
   can perhaps steal a few low bits and only do it for sendmsg. But why?
   Because now you have multiple buffer IDs in a send completion, and
   you need to be able to unravel it. If we always receive and send in
   order, then it'd always been contiguous, which I took advantage of.
   Not a huge deal, just mentioning some of the steps I had to go
   through.

2) The iovec. Fine if you have the above data structure, as you can just
   alloc/realloc -> free when done. I just took the easy path and made
   the iovec big enough (eg ring size).

Outside of that, didn't need to make a lot of changes:

 1 file changed, 39 insertions(+), 17 deletions(-)

Performance is great, because we get to pass in N (in my case ~65)
packets per send. No more per packet locking. Which I do think
highlights that we can do better on the multishot send/sendmsg by
grabbing buffers upfront and passing them in in one go rather than
simply loop around calling tcp_sendmsg_locked() for each one.

Anyway, something to look into!
Jens Axboe Feb. 29, 2024, 1:46 a.m. UTC | #18
On 2/28/24 4:49 PM, Jens Axboe wrote:
> On 2/28/24 10:28 AM, Jens Axboe wrote:
>> When I have some time I can do add the append case, or feel free to do
>> that yourself, and I can run some testing with that too.
> 
> I did a quick hack to add the append mode, and by default we get roughly
> ring_size / 2 number of appended vecs, which I guess is expected.
> There's a few complications there:
> 
> 1) We basically need a per-send data structure at this point. While
>    that's not hard to do, at least for my case I'd need to add that just
>    for this case and everything would now need to do it. Perhaps. You
>    can perhaps steal a few low bits and only do it for sendmsg. But why?
>    Because now you have multiple buffer IDs in a send completion, and
>    you need to be able to unravel it. If we always receive and send in
>    order, then it'd always been contiguous, which I took advantage of.
>    Not a huge deal, just mentioning some of the steps I had to go
>    through.
> 
> 2) The iovec. Fine if you have the above data structure, as you can just
>    alloc/realloc -> free when done. I just took the easy path and made
>    the iovec big enough (eg ring size).
> 
> Outside of that, didn't need to make a lot of changes:
> 
>  1 file changed, 39 insertions(+), 17 deletions(-)
> 
> Performance is great, because we get to pass in N (in my case ~65)
> packets per send. No more per packet locking. Which I do think
> highlights that we can do better on the multishot send/sendmsg by
> grabbing buffers upfront and passing them in in one go rather than
> simply loop around calling tcp_sendmsg_locked() for each one.

In terms of absolute numbers, previous best times were multishot send,
with ran in 3125 usec. Using either the above approach, or a hacked up
version of multishot send that uses provided buffers and bundles them
into one send (ala sendmsg), the runtimes are within 1% of each other
and too close to call. But the runtime is around 2320, or aroudn 25%
faster than doing one issue at the time.

This is using the same small packet size of 32 bytes. Just did the
bundled send multishot thing to test, haven't tested more than that so
far.
Jens Axboe Feb. 29, 2024, 3:42 p.m. UTC | #19
On 2/28/24 6:46 PM, Jens Axboe wrote:
> On 2/28/24 4:49 PM, Jens Axboe wrote:
>> On 2/28/24 10:28 AM, Jens Axboe wrote:
>>> When I have some time I can do add the append case, or feel free to do
>>> that yourself, and I can run some testing with that too.
>>
>> I did a quick hack to add the append mode, and by default we get roughly
>> ring_size / 2 number of appended vecs, which I guess is expected.
>> There's a few complications there:
>>
>> 1) We basically need a per-send data structure at this point. While
>>    that's not hard to do, at least for my case I'd need to add that just
>>    for this case and everything would now need to do it. Perhaps. You
>>    can perhaps steal a few low bits and only do it for sendmsg. But why?
>>    Because now you have multiple buffer IDs in a send completion, and
>>    you need to be able to unravel it. If we always receive and send in
>>    order, then it'd always been contiguous, which I took advantage of.
>>    Not a huge deal, just mentioning some of the steps I had to go
>>    through.
>>
>> 2) The iovec. Fine if you have the above data structure, as you can just
>>    alloc/realloc -> free when done. I just took the easy path and made
>>    the iovec big enough (eg ring size).
>>
>> Outside of that, didn't need to make a lot of changes:
>>
>>  1 file changed, 39 insertions(+), 17 deletions(-)
>>
>> Performance is great, because we get to pass in N (in my case ~65)
>> packets per send. No more per packet locking. Which I do think
>> highlights that we can do better on the multishot send/sendmsg by
>> grabbing buffers upfront and passing them in in one go rather than
>> simply loop around calling tcp_sendmsg_locked() for each one.
> 
> In terms of absolute numbers, previous best times were multishot send,
> with ran in 3125 usec. Using either the above approach, or a hacked up
> version of multishot send that uses provided buffers and bundles them
> into one send (ala sendmsg), the runtimes are within 1% of each other
> and too close to call. But the runtime is around 2320, or aroudn 25%
> faster than doing one issue at the time.
> 
> This is using the same small packet size of 32 bytes. Just did the
> bundled send multishot thing to test, haven't tested more than that so
> far.

Update: I had a bug in the sendmsg with multiple vecs, it did not have a
single msg inflight at the same time, it could be multiple. Which
obviously can't work. That voids the sendmsg append results for now.
Will fiddle with it a bit and get it working, and post the full results
when I have them.
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 74c3afac9c63..6766e78ee03b 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -351,9 +351,17 @@  enum io_uring_op {
  *				0 is reported if zerocopy was actually possible.
  *				IORING_NOTIF_USAGE_ZC_COPIED if data was copied
  *				(at least partially).
+ *
+ * IORING_SEND_MULTISHOT	Multishot send. Like the recv equivalent, must
+ *				be used with provided buffers. Keeps sending
+ *				from the given buffer group ID until it is
+ *				empty. Sets IORING_CQE_F_MORE if more
+ *				completions should be expected on behalf of
+ *				the same SQE.
  */
 #define IORING_RECVSEND_POLL_FIRST	(1U << 0)
 #define IORING_RECV_MULTISHOT		(1U << 1)
+#define IORING_SEND_MULTISHOT		IORING_RECV_MULTISHOT
 #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
 #define IORING_SEND_ZC_REPORT_USAGE	(1U << 3)
 
diff --git a/io_uring/net.c b/io_uring/net.c
index 30afb394efd7..8237ac5c957f 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -204,6 +204,16 @@  static int io_setup_async_msg(struct io_kiocb *req,
 	return -EAGAIN;
 }
 
+static inline void io_mshot_prep_retry(struct io_kiocb *req)
+{
+	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+
+	req->flags &= ~REQ_F_BL_EMPTY;
+	sr->done_io = 0;
+	sr->len = 0; /* get from the provided buffer */
+	req->buf_index = sr->buf_group;
+}
+
 static bool io_recvmsg_multishot_overflow(struct io_async_msghdr *iomsg)
 {
 	int hdr;
@@ -401,6 +411,8 @@  void io_sendmsg_recvmsg_cleanup(struct io_kiocb *req)
 	kfree(io->free_iov);
 }
 
+#define SENDMSG_FLAGS (IORING_RECVSEND_POLL_FIRST | IORING_SEND_MULTISHOT)
+
 int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
@@ -417,11 +429,19 @@  int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	sr->umsg = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	sr->len = READ_ONCE(sqe->len);
 	sr->flags = READ_ONCE(sqe->ioprio);
-	if (sr->flags & ~IORING_RECVSEND_POLL_FIRST)
+	if (sr->flags & ~SENDMSG_FLAGS)
 		return -EINVAL;
 	sr->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL;
 	if (sr->msg_flags & MSG_DONTWAIT)
 		req->flags |= REQ_F_NOWAIT;
+	if (sr->flags & IORING_SEND_MULTISHOT) {
+		if (!(req->flags & REQ_F_BUFFER_SELECT))
+			return -EINVAL;
+		if (sr->msg_flags & MSG_WAITALL)
+			return -EINVAL;
+		req->flags |= REQ_F_APOLL_MULTISHOT;
+		sr->buf_group = req->buf_index;
+	}
 
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
@@ -431,6 +451,44 @@  int io_sendmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return 0;
 }
 
+static inline bool io_send_finish(struct io_kiocb *req, int *ret,
+				  struct msghdr *msg, unsigned issue_flags)
+{
+	bool mshot_finished = *ret <= 0;
+	unsigned int cflags;
+
+	cflags = io_put_kbuf(req, issue_flags);
+
+	if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
+		io_req_set_res(req, *ret, cflags);
+		*ret = IOU_OK;
+		return true;
+	}
+
+	if (mshot_finished || req->flags & REQ_F_BL_EMPTY)
+		goto finish;
+
+	/*
+	 * Fill CQE for this receive and see if we should keep trying to
+	 * receive from this socket.
+	 */
+	if (io_fill_cqe_req_aux(req, issue_flags & IO_URING_F_COMPLETE_DEFER,
+				*ret, cflags | IORING_CQE_F_MORE)) {
+		io_mshot_prep_retry(req);
+		*ret = IOU_ISSUE_SKIP_COMPLETE;
+		return false;
+	}
+
+	/* Otherwise stop multishot but use the current result. */
+finish:
+	io_req_set_res(req, *ret, cflags);
+	if (issue_flags & IO_URING_F_MULTISHOT)
+		*ret = IOU_STOP_MULTISHOT;
+	else
+		*ret = IOU_OK;
+	return true;
+}
+
 int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
@@ -511,7 +569,6 @@  int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	struct sockaddr_storage __address;
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
 	size_t len = sr->len;
-	unsigned int cflags;
 	struct socket *sock;
 	struct msghdr msg;
 	unsigned flags;
@@ -542,10 +599,14 @@  int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	    (sr->flags & IORING_RECVSEND_POLL_FIRST))
 		return io_setup_async_addr(req, &__address, issue_flags);
 
+	if (!io_check_multishot(req, issue_flags))
+		return -EAGAIN;
+
 	sock = sock_from_file(req->file);
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
+retry_multishot:
 	if (io_do_buffer_select(req)) {
 		void __user *buf;
 
@@ -570,8 +631,16 @@  int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_flags = flags;
 	ret = sock_sendmsg(sock, &msg);
 	if (ret < min_ret) {
-		if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
-			return io_setup_async_addr(req, &__address, issue_flags);
+		if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK)) {
+			ret = io_setup_async_addr(req, &__address, issue_flags);
+			if (ret != -EAGAIN)
+				return ret;
+			if (issue_flags & IO_URING_F_MULTISHOT) {
+				io_kbuf_recycle(req, issue_flags);
+				return IOU_ISSUE_SKIP_COMPLETE;
+			}
+			return -EAGAIN;
+		}
 
 		if (ret > 0 && io_net_retry(sock, flags)) {
 			sr->len -= ret;
@@ -588,9 +657,13 @@  int io_send(struct io_kiocb *req, unsigned int issue_flags)
 		ret += sr->done_io;
 	else if (sr->done_io)
 		ret = sr->done_io;
-	cflags = io_put_kbuf(req, issue_flags);
-	io_req_set_res(req, ret, cflags);
-	return IOU_OK;
+	else
+		io_kbuf_recycle(req, issue_flags);
+
+	if (!io_send_finish(req, &ret, &msg, issue_flags))
+		goto retry_multishot;
+
+	return ret;
 }
 
 int io_recvmsg_prep_async(struct io_kiocb *req)
@@ -654,15 +727,6 @@  int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	return 0;
 }
 
-static inline void io_recv_prep_retry(struct io_kiocb *req)
-{
-	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
-
-	sr->done_io = 0;
-	sr->len = 0; /* get from the provided buffer */
-	req->buf_index = sr->buf_group;
-}
-
 /*
  * Finishes io_recv and io_recvmsg.
  *
@@ -697,7 +761,7 @@  static inline bool io_recv_finish(struct io_kiocb *req, int *ret,
 		struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
 		int mshot_retry_ret = IOU_ISSUE_SKIP_COMPLETE;
 
-		io_recv_prep_retry(req);
+		io_mshot_prep_retry(req);
 		/* Known not-empty or unknown state, retry */
 		if (cflags & IORING_CQE_F_SOCK_NONEMPTY || msg->msg_inq == -1) {
 			if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY)