mbox series

[for-6.0,0/5] IORING_OP_SEND_ZC improvements

Message ID cover.1663363798.git.metze@samba.org (mailing list archive)
Headers show
Series IORING_OP_SEND_ZC improvements | expand

Message

Stefan Metzmacher Sept. 16, 2022, 9:36 p.m. UTC
Hi Pavel, hi Jens,

I did some initial testing with IORING_OP_SEND_ZC.
While reading the code I think I found a race that
can lead to IORING_CQE_F_MORE being missing even if
the net layer got references.

While there I added some code to allow userpace to
know how effective the IORING_OP_SEND_ZC attempt was,
in order to avoid it it's not used (e.g. on a long living tcp
connection).

This change requires a change to the existing test, see:
https://github.com/metze-samba/liburing/tree/test-send-zerocopy

Stefan Metzmacher (5):
  io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
  io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res
  io_uring/core: keep req->cqe.flags on generic errors
  io_uring/net: let io_sendzc set IORING_CQE_F_MORE before
    sock_sendmsg()
  io_uring/notif: let userspace know how effective the zero copy usage
    was

 include/linux/io_uring_types.h |  6 +++---
 io_uring/io_uring.c            | 18 +++++++++++++-----
 io_uring/net.c                 | 19 +++++++++++++------
 io_uring/notif.c               | 18 ++++++++++++++++++
 io_uring/opdef.c               |  2 +-
 net/ipv4/ip_output.c           |  3 ++-
 net/ipv4/tcp.c                 |  2 ++
 net/ipv6/ip6_output.c          |  3 ++-
 8 files changed, 54 insertions(+), 17 deletions(-)

Comments

Pavel Begunkov Sept. 17, 2022, 9:16 a.m. UTC | #1
On 9/16/22 22:36, Stefan Metzmacher wrote:
> Hi Pavel, hi Jens,
> 
> I did some initial testing with IORING_OP_SEND_ZC.
> While reading the code I think I found a race that
> can lead to IORING_CQE_F_MORE being missing even if
> the net layer got references.

Hey Stefan,

Did you see some kind of buggy behaviour in userspace?
If network sends anything it should return how many bytes
it queued for sending, otherwise there would be duplicated
packets / data on the other endpoint in userspace, and I
don't think any driver / lower layer would keep memory
after returning an error.

In any case, I was looking on a bit different problem, but
it should look much cleaner using the same approach, see
branch [1], and patch [3] for sendzc in particular.

[1] https://github.com/isilence/linux.git partial-fail
[2] https://github.com/isilence/linux/tree/io_uring/partial-fail
[3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894


> While there I added some code to allow userpace to
> know how effective the IORING_OP_SEND_ZC attempt was,
> in order to avoid it it's not used (e.g. on a long living tcp
> connection).> 
> This change requires a change to the existing test, see:
> https://github.com/metze-samba/liburing/tree/test-send-zerocopy
> 
> Stefan Metzmacher (5):
>    io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
>    io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res
>    io_uring/core: keep req->cqe.flags on generic errors
>    io_uring/net: let io_sendzc set IORING_CQE_F_MORE before
>      sock_sendmsg()
>    io_uring/notif: let userspace know how effective the zero copy usage
>      was
> 
>   include/linux/io_uring_types.h |  6 +++---
>   io_uring/io_uring.c            | 18 +++++++++++++-----
>   io_uring/net.c                 | 19 +++++++++++++------
>   io_uring/notif.c               | 18 ++++++++++++++++++
>   io_uring/opdef.c               |  2 +-
>   net/ipv4/ip_output.c           |  3 ++-
>   net/ipv4/tcp.c                 |  2 ++
>   net/ipv6/ip6_output.c          |  3 ++-
>   8 files changed, 54 insertions(+), 17 deletions(-)
>
Stefan Metzmacher Sept. 17, 2022, 10:44 a.m. UTC | #2
Am 17.09.22 um 11:16 schrieb Pavel Begunkov:
> On 9/16/22 22:36, Stefan Metzmacher wrote:
>> Hi Pavel, hi Jens,
>>
>> I did some initial testing with IORING_OP_SEND_ZC.
>> While reading the code I think I found a race that
>> can lead to IORING_CQE_F_MORE being missing even if
>> the net layer got references.
> 
> Hey Stefan,
> 
> Did you see some kind of buggy behaviour in userspace?

No I was just reading the code and found it a bit confusing,
and couldn't prove that we don't have a problem with loosing
a notif cqe.

> If network sends anything it should return how many bytes
> it queued for sending, otherwise there would be duplicated
> packets / data on the other endpoint in userspace, and I
> don't think any driver / lower layer would keep memory
> after returning an error.

As I'm also working on a socket driver for smbdirect,
I already thought about how I could hook into
IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
a loop sending individual fragments, which have a reference,
but if I find a connection drop after the first one, I'd
return ECONNRESET or EPIPE in order to get faster recovery
instead of announcing a short write to the caller.

If we would take my 5/5 we could also have a different
strategy to check decide if MORE/NOTIF is needed.
If notif->cqe.res is still 0 and io_notif_flush drops
the last reference we could go without MORE/NOTIF at all.
In all other cases we'd either set MORE/NOTIF at the end
of io_sendzc of in the fail hook.

> In any case, I was looking on a bit different problem, but
> it should look much cleaner using the same approach, see
> branch [1], and patch [3] for sendzc in particular.
> 
> [1] https://github.com/isilence/linux.git partial-fail
> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894

	const struct io_op_def *def = &io_op_defs[req->opcode];

	req_set_fail(req);
	io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
	if (def->fail)
		def->fail(req);
	io_req_complete_post(req);

Will loose req->cqe.flags, but the fail hook in general looks like a good idea.

And don't we care about the other failure cases where req->cqe.flags gets overwritten?

metze
Jens Axboe Sept. 18, 2022, 10:49 p.m. UTC | #3
On Fri, 16 Sep 2022 23:36:24 +0200, Stefan Metzmacher wrote:
> I did some initial testing with IORING_OP_SEND_ZC.
> While reading the code I think I found a race that
> can lead to IORING_CQE_F_MORE being missing even if
> the net layer got references.
> 
> While there I added some code to allow userpace to
> know how effective the IORING_OP_SEND_ZC attempt was,
> in order to avoid it it's not used (e.g. on a long living tcp
> connection).
> 
> [...]

Applied, thanks!

[1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
      commit: 9bd3f728223ebcfef8e9d087bdd142f0e388215d

Best regards,
Pavel Begunkov Sept. 21, 2022, 11:39 a.m. UTC | #4
On 9/17/22 11:44, Stefan Metzmacher wrote:
> Am 17.09.22 um 11:16 schrieb Pavel Begunkov:
>> On 9/16/22 22:36, Stefan Metzmacher wrote:
>>> Hi Pavel, hi Jens,
>>>
>>> I did some initial testing with IORING_OP_SEND_ZC.
>>> While reading the code I think I found a race that
>>> can lead to IORING_CQE_F_MORE being missing even if
>>> the net layer got references.
>>
>> Hey Stefan,
>>
>> Did you see some kind of buggy behaviour in userspace?

Apologies for the delay,

> No I was just reading the code and found it a bit confusing,
> and couldn't prove that we don't have a problem with loosing
> a notif cqe.
> 
>> If network sends anything it should return how many bytes
>> it queued for sending, otherwise there would be duplicated
>> packets / data on the other endpoint in userspace, and I
>> don't think any driver / lower layer would keep memory
>> after returning an error.
> 
> As I'm also working on a socket driver for smbdirect,
> I already thought about how I could hook into
> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
> a loop sending individual fragments, which have a reference,
> but if I find a connection drop after the first one, I'd
> return ECONNRESET or EPIPE in order to get faster recovery
> instead of announcing a short write to the caller.

I doesn't sound right for me, but neither I know samba to
really have an opinion. In any case, I see how it may be
more robust if we always try to push a notification cqe.
Will you send a patch?

> If we would take my 5/5 we could also have a different
> strategy to check decide if MORE/NOTIF is needed.
> If notif->cqe.res is still 0 and io_notif_flush drops
> the last reference we could go without MORE/NOTIF at all.
> In all other cases we'd either set MORE/NOTIF at the end
> of io_sendzc of in the fail hook.

I had a similar optimisation, i.e. when io_notif_flush() in
the submission path is dropping the last ref, but killed it
as it was completely useless, I haven't hit this path even
once even with UDP, not to mention TCP.

>> In any case, I was looking on a bit different problem, but
>> it should look much cleaner using the same approach, see
>> branch [1], and patch [3] for sendzc in particular.
>>
>> [1] https://github.com/isilence/linux.git partial-fail
>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894
> 
>      const struct io_op_def *def = &io_op_defs[req->opcode];
> 
>      req_set_fail(req);
>      io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
>      if (def->fail)
>          def->fail(req);
>      io_req_complete_post(req);
> 
> Will loose req->cqe.flags, but the fail hook in general looks like a good idea.

I just don't like those sporadic changes all across core io_uring
code also adding some overhead.

> And don't we care about the other failure cases where req->cqe.flags gets overwritten?

We don't usually carry them around ->issue handler boundaries,
e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE);

IORING_CQE_F_BUFFER is a bit more trickier, but there is
special handling for this one and it wouldn't fit "set cflags
in advance" logic anyway.

iow, ->fail callback sounds good enough for now, we'll change
it later if needed.
Stefan Metzmacher Sept. 21, 2022, 12:18 p.m. UTC | #5
Hi Pavel,

>>> If network sends anything it should return how many bytes
>>> it queued for sending, otherwise there would be duplicated
>>> packets / data on the other endpoint in userspace, and I
>>> don't think any driver / lower layer would keep memory
>>> after returning an error.
>>
>> As I'm also working on a socket driver for smbdirect,
>> I already thought about how I could hook into
>> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
>> a loop sending individual fragments, which have a reference,
>> but if I find a connection drop after the first one, I'd
>> return ECONNRESET or EPIPE in order to get faster recovery
>> instead of announcing a short write to the caller.
> 
> I doesn't sound right for me, but neither I know samba to
> really have an opinion. In any case, I see how it may be
> more robust if we always try to push a notification cqe.
> Will you send a patch?

You mean the IORING_OP_SEND_ZC should always
issue a NOTIF cqe, one it passed the io_sendzc_prep stage?

>> If we would take my 5/5 we could also have a different
>> strategy to check decide if MORE/NOTIF is needed.
>> If notif->cqe.res is still 0 and io_notif_flush drops
>> the last reference we could go without MORE/NOTIF at all.
>> In all other cases we'd either set MORE/NOTIF at the end
>> of io_sendzc of in the fail hook.
> 
> I had a similar optimisation, i.e. when io_notif_flush() in
> the submission path is dropping the last ref, but killed it
> as it was completely useless, I haven't hit this path even
> once even with UDP, not to mention TCP.

If I remember correctly I hit it all the time on loopback,
but I'd have to recheck.

>>> In any case, I was looking on a bit different problem, but
>>> it should look much cleaner using the same approach, see
>>> branch [1], and patch [3] for sendzc in particular.
>>>
>>> [1] https://github.com/isilence/linux.git partial-fail
>>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
>>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894
>>
>>      const struct io_op_def *def = &io_op_defs[req->opcode];
>>
>>      req_set_fail(req);
>>      io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
>>      if (def->fail)
>>          def->fail(req);
>>      io_req_complete_post(req);
>>
>> Will loose req->cqe.flags, but the fail hook in general looks like a good idea.
> 
> I just don't like those sporadic changes all across core io_uring
> code also adding some overhead.
> 
>> And don't we care about the other failure cases where req->cqe.flags gets overwritten?
> 
> We don't usually carry them around ->issue handler boundaries,
> e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE);
> 
> IORING_CQE_F_BUFFER is a bit more trickier, but there is
> special handling for this one and it wouldn't fit "set cflags
> in advance" logic anyway.
> 
> iow, ->fail callback sounds good enough for now, we'll change
> it later if needed.

The fail hook should re-add the MORE flag?

So I'll try to do the following changes:

1. take your ->fail() patches

2. once io_sendzc_prep() is over always trigger MORE/NOFIF cqes
    (But the documentation should still make it clear that
     userspace have to cope with just a single cqe (without MORE)
     for both successs and failure, so that we can improve things later)

3. Can I change the cqe.res of the NOTIF cqe to be 0xffffffff ?
    That would indicate to userspace that we don't give any information
    if zero copy was actually used or not. This would present someone
    from relying on cqe.res == 0 and we can improve it by providing
    more useful values in future.

Are you ok with that plan for 6.0?

metze
Pavel Begunkov Sept. 21, 2022, 12:58 p.m. UTC | #6
On 9/21/22 13:18, Stefan Metzmacher wrote:
> 
> Hi Pavel,
> 
>>>> If network sends anything it should return how many bytes
>>>> it queued for sending, otherwise there would be duplicated
>>>> packets / data on the other endpoint in userspace, and I
>>>> don't think any driver / lower layer would keep memory
>>>> after returning an error.
>>>
>>> As I'm also working on a socket driver for smbdirect,
>>> I already thought about how I could hook into
>>> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
>>> a loop sending individual fragments, which have a reference,
>>> but if I find a connection drop after the first one, I'd
>>> return ECONNRESET or EPIPE in order to get faster recovery
>>> instead of announcing a short write to the caller.
>>
>> I doesn't sound right for me, but neither I know samba to
>> really have an opinion. In any case, I see how it may be
>> more robust if we always try to push a notification cqe.
>> Will you send a patch?
> 
> You mean the IORING_OP_SEND_ZC should always
> issue a NOTIF cqe, one it passed the io_sendzc_prep stage?

Currently we add F_MORE iff cqe->res >= 0, but I want to
decouple them so users don't try to infer one from another.

In theory, it's already a subset of this, but it sounds
like a good idea to always emit notifications (when we can)
to stop users making assumptions about it. And should also
be cleaner.

>>> If we would take my 5/5 we could also have a different
>>> strategy to check decide if MORE/NOTIF is needed.
>>> If notif->cqe.res is still 0 and io_notif_flush drops
>>> the last reference we could go without MORE/NOTIF at all.
>>> In all other cases we'd either set MORE/NOTIF at the end
>>> of io_sendzc of in the fail hook.
>>
>> I had a similar optimisation, i.e. when io_notif_flush() in
>> the submission path is dropping the last ref, but killed it
>> as it was completely useless, I haven't hit this path even
>> once even with UDP, not to mention TCP.
> 
> If I remember correctly I hit it all the time on loopback,
> but I'd have to recheck.

Yeah, I meant real network in particular. I've seen it with
virtual interfaces, but for instance loopback is not interesting
as it doesn't support zerocopy in the first place. You was
probably testing with a modified skb_orphan_frags_rx().

>>>> In any case, I was looking on a bit different problem, but
>>>> it should look much cleaner using the same approach, see
>>>> branch [1], and patch [3] for sendzc in particular.
>>>>
>>>> [1] https://github.com/isilence/linux.git partial-fail
>>>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
>>>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894
>>>
>>>      const struct io_op_def *def = &io_op_defs[req->opcode];
>>>
>>>      req_set_fail(req);
>>>      io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
>>>      if (def->fail)
>>>          def->fail(req);
>>>      io_req_complete_post(req);
>>>
>>> Will loose req->cqe.flags, but the fail hook in general looks like a good idea.
>>
>> I just don't like those sporadic changes all across core io_uring
>> code also adding some overhead.
>>
>>> And don't we care about the other failure cases where req->cqe.flags gets overwritten?
>>
>> We don't usually carry them around ->issue handler boundaries,
>> e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE);
>>
>> IORING_CQE_F_BUFFER is a bit more trickier, but there is
>> special handling for this one and it wouldn't fit "set cflags
>> in advance" logic anyway.
>>
>> iow, ->fail callback sounds good enough for now, we'll change
>> it later if needed.
> 
> The fail hook should re-add the MORE flag?

Never set CQE_SKIP for notifications and add MORE flag in the
fail hook, sounds good.


> So I'll try to do the following changes:
> 
> 1. take your ->fail() patches
> 
> 2. once io_sendzc_prep() is over always trigger MORE/NOFIF cqes
>     (But the documentation should still make it clear that
>      userspace have to cope with just a single cqe (without MORE)
>      for both successs and failure, so that we can improve things later)

I've sent a liburing man patch, should be good enough.

> 3. Can I change the cqe.res of the NOTIF cqe to be 0xffffffff ?
>     That would indicate to userspace that we don't give any information
>     if zero copy was actually used or not. This would present someone
>     from relying on cqe.res == 0 and we can improve it by providing
>     more useful values in future.

I don't get the difference, someone just as easily may rely on
0xf..ff. What does it gives us? 0 usually suits better as default.

> Are you ok with that plan for 6.0?

It's late 1-2 are better to be 6.1 + stable. It doesn't break uapi,
so it's fine. It's not even strictly necessary to back port it
(but still a good idea to do it).