diff mbox series

[8/8] io_uring/net: set MSG_MORE if we're doing multishot send and have more

Message ID 20240225003941.129030-9-axboe@kernel.dk (mailing list archive)
State New
Headers show
Series Support for provided buffers for send | expand

Commit Message

Jens Axboe Feb. 25, 2024, 12:35 a.m. UTC
If we have more data pending, we know we're going to do one more loop.
If that's the case, then set MSG_MORE to inform the networking stack
that there's more data coming shortly for this socket.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/net.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Dylan Yudaken Feb. 26, 2024, 10:59 a.m. UTC | #1
On Sun, Feb 25, 2024 at 12:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> If we have more data pending, we know we're going to do one more loop.
> If that's the case, then set MSG_MORE to inform the networking stack
> that there's more data coming shortly for this socket.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  io_uring/net.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/io_uring/net.c b/io_uring/net.c
> index 240b8eff1a78..07307dd5a077 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -519,6 +519,10 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>         if (!io_check_multishot(req, issue_flags))
>                 return io_setup_async_msg(req, kmsg, issue_flags);
>
> +       flags = sr->msg_flags;
> +       if (issue_flags & IO_URING_F_NONBLOCK)
> +               flags |= MSG_DONTWAIT;
> +
>  retry_multishot:
>         if (io_do_buffer_select(req)) {
>                 void __user *buf;
> @@ -528,12 +532,12 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>                 if (!buf)
>                         return -ENOBUFS;
>
> +               if ((req->flags & (REQ_F_BL_EMPTY|REQ_F_APOLL_MULTISHOT)) ==
> +                                  REQ_F_APOLL_MULTISHOT)
> +                       flags |= MSG_MORE;
>                 iov_iter_ubuf(&kmsg->msg.msg_iter, ITER_SOURCE, buf, len);
>         }

This feels racy. I don't have an exact sequence in mind, but I believe
there are cases where between
the two calls to __sys_sendmsg_sock, another submission could be
issued and drain the buffer list.
I guess the result would be that the packet is never sent out, but I
have not followed the codepaths of MSG_MORE.

The obvious other way to trigger this codepath is if the user messes
with the ring by decrementing
the buffer counter. I do not believe there are any nefarious outcomes
- but just to point out that
REQ_F_BL_EMPTY is essentially user controlled.
Jens Axboe Feb. 26, 2024, 1:42 p.m. UTC | #2
On 2/26/24 3:59 AM, Dylan Yudaken wrote:
> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> If we have more data pending, we know we're going to do one more loop.
>> If that's the case, then set MSG_MORE to inform the networking stack
>> that there's more data coming shortly for this socket.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  io_uring/net.c | 10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/io_uring/net.c b/io_uring/net.c
>> index 240b8eff1a78..07307dd5a077 100644
>> --- a/io_uring/net.c
>> +++ b/io_uring/net.c
>> @@ -519,6 +519,10 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>         if (!io_check_multishot(req, issue_flags))
>>                 return io_setup_async_msg(req, kmsg, issue_flags);
>>
>> +       flags = sr->msg_flags;
>> +       if (issue_flags & IO_URING_F_NONBLOCK)
>> +               flags |= MSG_DONTWAIT;
>> +
>>  retry_multishot:
>>         if (io_do_buffer_select(req)) {
>>                 void __user *buf;
>> @@ -528,12 +532,12 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>                 if (!buf)
>>                         return -ENOBUFS;
>>
>> +               if ((req->flags & (REQ_F_BL_EMPTY|REQ_F_APOLL_MULTISHOT)) ==
>> +                                  REQ_F_APOLL_MULTISHOT)
>> +                       flags |= MSG_MORE;
>>                 iov_iter_ubuf(&kmsg->msg.msg_iter, ITER_SOURCE, buf, len);
>>         }
> 
> This feels racy. I don't have an exact sequence in mind, but I believe
> there are cases where between
> the two calls to __sys_sendmsg_sock, another submission could be
> issued and drain the buffer list.
> I guess the result would be that the packet is never sent out, but I
> have not followed the codepaths of MSG_MORE.

This is true, but that race always exists depending on how gets to go
first (the adding of the buffer, or the send itself). The way I see it,
when the send is issued we're making the guarantee that we're going to
at least deplete the queue as it looks when entered. If more is added
while it's being processed, we _may_ see it.

Outside of that, we don't want it to potentially run in perpetuity. It
may actually be a good idea to make the rule of "just issue what was
there when first seen/issued" a hard one, though I don't think it's
really worth doing. But making any guarantees on buffers added in
parallel will be impossible. If you do that, then you have to deal with
figuring out what's left in the queue once you get a completion withou
CQE_F_MORE.

> The obvious other way to trigger this codepath is if the user messes
> with the ring by decrementing
> the buffer counter. I do not believe there are any nefarious outcomes
> - but just to point out that
> REQ_F_BL_EMPTY is essentially user controlled.

The user may certainly shoot himself in the foot. As long as that
doesn't lead to a nefarious outcome, then that's not a concern. For this
case, the head is kernel local, user can only write to the tail. So we
could have a case of user fiddling with the tail and when we grab the
next buffer (and the previous one did not have REQ_F_BL_EMPTY set), the
ring will indeed appear to be empty. At that point you get an -ENOBUFS
without CQE_F_MORE set.
Pavel Begunkov Feb. 26, 2024, 2:24 p.m. UTC | #3
On 2/26/24 13:42, Jens Axboe wrote:
> On 2/26/24 3:59 AM, Dylan Yudaken wrote:
>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> If we have more data pending, we know we're going to do one more loop.
>>> If that's the case, then set MSG_MORE to inform the networking stack
>>> that there's more data coming shortly for this socket.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>   io_uring/net.c | 10 +++++++---
>>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>> index 240b8eff1a78..07307dd5a077 100644
>>> --- a/io_uring/net.c
>>> +++ b/io_uring/net.c
>>> @@ -519,6 +519,10 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>>          if (!io_check_multishot(req, issue_flags))
>>>                  return io_setup_async_msg(req, kmsg, issue_flags);
>>>
>>> +       flags = sr->msg_flags;
>>> +       if (issue_flags & IO_URING_F_NONBLOCK)
>>> +               flags |= MSG_DONTWAIT;
>>> +
>>>   retry_multishot:
>>>          if (io_do_buffer_select(req)) {
>>>                  void __user *buf;
>>> @@ -528,12 +532,12 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>>                  if (!buf)
>>>                          return -ENOBUFS;
>>>
>>> +               if ((req->flags & (REQ_F_BL_EMPTY|REQ_F_APOLL_MULTISHOT)) ==
>>> +                                  REQ_F_APOLL_MULTISHOT)
>>> +                       flags |= MSG_MORE;
>>>                  iov_iter_ubuf(&kmsg->msg.msg_iter, ITER_SOURCE, buf, len);
>>>          }
>>
>> This feels racy. I don't have an exact sequence in mind, but I believe
>> there are cases where between
>> the two calls to __sys_sendmsg_sock, another submission could be
>> issued and drain the buffer list.
>> I guess the result would be that the packet is never sent out, but I
>> have not followed the codepaths of MSG_MORE.
> 
> This is true, but that race always exists depending on how gets to go
> first (the adding of the buffer, or the send itself). The way I see it,
> when the send is issued we're making the guarantee that we're going to
> at least deplete the queue as it looks when entered. If more is added
> while it's being processed, we _may_ see it.
> 
> Outside of that, we don't want it to potentially run in perpetuity. It
> may actually be a good idea to make the rule of "just issue what was
> there when first seen/issued" a hard one, though I don't think it's
> really worth doing. But making any guarantees on buffers added in
> parallel will be impossible. If you do that, then you have to deal with
> figuring out what's left in the queue once you get a completion withou
> CQE_F_MORE.
> 
>> The obvious other way to trigger this codepath is if the user messes
>> with the ring by decrementing
>> the buffer counter. I do not believe there are any nefarious outcomes
>> - but just to point out that
>> REQ_F_BL_EMPTY is essentially user controlled.
> 
> The user may certainly shoot himself in the foot. As long as that
> doesn't lead to a nefarious outcome, then that's not a concern. For this
> case, the head is kernel local, user can only write to the tail. So we
> could have a case of user fiddling with the tail and when we grab the
> next buffer (and the previous one did not have REQ_F_BL_EMPTY set), the
> ring will indeed appear to be empty. At that point you get an -ENOBUFS
> without CQE_F_MORE set.

A side note, don't forget that there are other protocols apart
from TCP. AFAIK UDP corking will pack it into a single datagram,
which is not the same as two separate sends.
Jens Axboe Feb. 26, 2024, 2:52 p.m. UTC | #4
On 2/26/24 7:24 AM, Pavel Begunkov wrote:
> On 2/26/24 13:42, Jens Axboe wrote:
>> On 2/26/24 3:59 AM, Dylan Yudaken wrote:
>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> If we have more data pending, we know we're going to do one more loop.
>>>> If that's the case, then set MSG_MORE to inform the networking stack
>>>> that there's more data coming shortly for this socket.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>   io_uring/net.c | 10 +++++++---
>>>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>> index 240b8eff1a78..07307dd5a077 100644
>>>> --- a/io_uring/net.c
>>>> +++ b/io_uring/net.c
>>>> @@ -519,6 +519,10 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>>>          if (!io_check_multishot(req, issue_flags))
>>>>                  return io_setup_async_msg(req, kmsg, issue_flags);
>>>>
>>>> +       flags = sr->msg_flags;
>>>> +       if (issue_flags & IO_URING_F_NONBLOCK)
>>>> +               flags |= MSG_DONTWAIT;
>>>> +
>>>>   retry_multishot:
>>>>          if (io_do_buffer_select(req)) {
>>>>                  void __user *buf;
>>>> @@ -528,12 +532,12 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
>>>>                  if (!buf)
>>>>                          return -ENOBUFS;
>>>>
>>>> +               if ((req->flags & (REQ_F_BL_EMPTY|REQ_F_APOLL_MULTISHOT)) ==
>>>> +                                  REQ_F_APOLL_MULTISHOT)
>>>> +                       flags |= MSG_MORE;
>>>>                  iov_iter_ubuf(&kmsg->msg.msg_iter, ITER_SOURCE, buf, len);
>>>>          }
>>>
>>> This feels racy. I don't have an exact sequence in mind, but I believe
>>> there are cases where between
>>> the two calls to __sys_sendmsg_sock, another submission could be
>>> issued and drain the buffer list.
>>> I guess the result would be that the packet is never sent out, but I
>>> have not followed the codepaths of MSG_MORE.
>>
>> This is true, but that race always exists depending on how gets to go
>> first (the adding of the buffer, or the send itself). The way I see it,
>> when the send is issued we're making the guarantee that we're going to
>> at least deplete the queue as it looks when entered. If more is added
>> while it's being processed, we _may_ see it.
>>
>> Outside of that, we don't want it to potentially run in perpetuity. It
>> may actually be a good idea to make the rule of "just issue what was
>> there when first seen/issued" a hard one, though I don't think it's
>> really worth doing. But making any guarantees on buffers added in
>> parallel will be impossible. If you do that, then you have to deal with
>> figuring out what's left in the queue once you get a completion withou
>> CQE_F_MORE.
>>
>>> The obvious other way to trigger this codepath is if the user messes
>>> with the ring by decrementing
>>> the buffer counter. I do not believe there are any nefarious outcomes
>>> - but just to point out that
>>> REQ_F_BL_EMPTY is essentially user controlled.
>>
>> The user may certainly shoot himself in the foot. As long as that
>> doesn't lead to a nefarious outcome, then that's not a concern. For this
>> case, the head is kernel local, user can only write to the tail. So we
>> could have a case of user fiddling with the tail and when we grab the
>> next buffer (and the previous one did not have REQ_F_BL_EMPTY set), the
>> ring will indeed appear to be empty. At that point you get an -ENOBUFS
>> without CQE_F_MORE set.
> 
> A side note, don't forget that there are other protocols apart
> from TCP. AFAIK UDP corking will pack it into a single datagram,
> which is not the same as two separate sends.

Yeah, should really have labeled this one as a test/rfc kind of patch. I
wasn't even convinced we want to do this uncondtionally for TCP. I'll
just leave it at the end for now, it's a separate kind of discussion
imho and this is why it was left as a separate patch rather than being
bundled with the multishot send in general.
diff mbox series

Patch

diff --git a/io_uring/net.c b/io_uring/net.c
index 240b8eff1a78..07307dd5a077 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -519,6 +519,10 @@  int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	if (!io_check_multishot(req, issue_flags))
 		return io_setup_async_msg(req, kmsg, issue_flags);
 
+	flags = sr->msg_flags;
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		flags |= MSG_DONTWAIT;
+
 retry_multishot:
 	if (io_do_buffer_select(req)) {
 		void __user *buf;
@@ -528,12 +532,12 @@  int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 		if (!buf)
 			return -ENOBUFS;
 
+		if ((req->flags & (REQ_F_BL_EMPTY|REQ_F_APOLL_MULTISHOT)) ==
+				   REQ_F_APOLL_MULTISHOT)
+			flags |= MSG_MORE;
 		iov_iter_ubuf(&kmsg->msg.msg_iter, ITER_SOURCE, buf, len);
 	}
 
-	flags = sr->msg_flags;
-	if (issue_flags & IO_URING_F_NONBLOCK)
-		flags |= MSG_DONTWAIT;
 	if (flags & MSG_WAITALL)
 		min_ret = iov_iter_count(&kmsg->msg.msg_iter);