mbox series

[RFC,00/12] io_uring zerocopy send

Message ID cover.1638282789.git.asml.silence@gmail.com (mailing list archive)
Headers show
Series io_uring zerocopy send | expand

Message

Pavel Begunkov Nov. 30, 2021, 3:18 p.m. UTC
Early proof of concept for zerocopy send via io_uring. This is just
an RFC, there are details yet to be figured out, but hope to gather
some feedback.

Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
or 2.44 times faster.

№ | test:                                | BW (MB/s)  | speedup
1 | msg_zerocopy (non-zc)                |  18281     | 0.38
2 | msg_zerocopy -z (baseline)           |  47421     | 1
3 | io_uring (@flush=false, nr_reqs=1)   |  96534     | 2.03
4 | io_uring (@flush=true,  nr_reqs=1)   |  89310     | 1.88
5 | io_uring (@flush=false, nr_reqs=8)   | 116079     | 2.44
6 | io_uring (@flush=true,  nr_reqs=8)   | 109722     | 2.31

Based on selftests/.../msg_zerocopy but more limited. You can use
msg_zerocopy -r as usual for receive side.

@nr_reqs controls how many io_uring requests we submit per syscall, e.g.
nr_reqs=1 avoids any io_uring batching, which is wasteful.
@flush controls whether to generate "buffer not used" event per request
or not at all. The API and implementation is more flexible, e.g. allows
to have one notification per N requests, but not added to the benchmark.

It's ipv4/udp only for now. The concept works well with tcp as well, but
omitted patches for now.

# design

The userspace design is briefly outlined in the message to 06/12. In
short, it allows to attach several io_uring send requests to one of
pre-registered notification contexts, and the userspace can "flush"
a notification from that context making it to post an event into
the io_uring's completion queue when buffers are no more in use.

From the net perspective, as ubuf_info's are controlled by io_uring, we
need a way to pass it from outside, which is currently done through a
new struct msghdr::msg_ubuf field.

Note: one great aspect ubufs being controlled from outside is that
it doesn't matter whether the net actually uses the ubuf or not, the
userspace gets the same API in terms of events delivery: it'll get
that additional event about buffers even it wasn't zerocopy.

Another big part is 5/12, which allows to skip get/put_page() for
io_uring's fixed buffers. io_uring ensures that the pages stay alive
until the corresponding ubuf_info is released.

# performance:

The worst case for io_uring is (4), still 1.88 times faster than
msg_zerocopy (2), and there are a couple of "easy" optimisations left
out from the patchset. For 4096 bytes payload zc is only slightly
outperforms non-zc version, the larger payload the wider gap.
I'll get more numbers next time.

Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
much. Notification posting is not a big problem for now, but need
to compare the performance for when io_uring_tx_zerocopy_callback()
is called from IRQ context, and possible rework it to use task_work.

It supports both, regular buffers and fixed ones, but there is a bunch of
optimisations exclusively for io_uring's fixed buffers. For comparison,
normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s

1) we pass a bvec, so no page table walks.
2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
   still doing page get/put (see 4/12) slashed 4-5%.
3) avoiding get_page/put_page in 5/12
4) completion events are posted into io_uring's CQ, so no
   extra recvmsg for getting events
5) no poll(2) in the code because of io_uring
6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
   io_uring caches the structures reducing it to nearly zero-overhead.

io_uring adds some overhead but there are also benefits of using it,
e.g. fixed files and submission batching (@nr_reqs). We can look at
@nr_reqs=1 test cases for more of a "net-only" optimisations.

# discussion / questions

I haven't got a grasp on many aspects of the net stack yet, so would
appreciate feedback in general and there are a couple of questions
thoughts.

1) What are initialisation rules for adding a new field into
struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
filling all the fields.

2) I don't like too much ubuf_info propagation from udp_sendmsg() into
__ip_append_data() (see 3/12). Ideas how to do it better?

3) 5/12 is a quick'n'dirty patch for letting upper layer to manage page
references but likely needs more work. One problem is to prevent mixing
such pages managed by upper layer and normal referenced ones, that's
what SKBFL_MANAGED_FRAGS is about and many chunk trying to make the
accounting of it right. Any pitfalls I missed? Would really love some
ideas and advice here, e.g. how to make it cleaner and/or remove
overhead from non-zc path.

# references

Kernel git branch for convenience:
https://github.com/isilence/linux.git zc_v1

Benchmark:
https://github.com/isilence/liburing.git zc_v1

or this file in particular:
https://github.com/isilence/liburing/blob/zc_v1/test/send-zc.c

To run the benchmark:
```
cd <liburing_dir> && make && cd test
# ./send-zc -4 [-p <port>] [-s <payload_size>] -D <destination> udp
./send-zc -4 -D 127.0.0.1 udp
```

msg_zerocopy can be used for the server side, e.g.
```
cd <linux-kernel>/tools/testing/selftests/net && make
./msg_zerocopy -4 -r [-p <port>] [-t <sec>] udp
```

Pavel Begunkov (12):
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: pass a struct ubuf_info in msghdr
  net/udp: add support msgdr::msg_ubuf
  net: add zerocopy_sg_from_iter for bvec
  net: optimise page get/free for bvec zc
  io_uring: add send notifiers registration
  io_uring: infrastructure for send zc notifications
  io_uring: wire send zc request type
  io_uring: add an option to flush zc notifications
  io_uring: opcode independent fixed buf import
  io_uring: sendzc with fixed buffers
  io_uring: cache struct ubuf_info

 fs/io_uring.c                 | 397 +++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h        |  15 +-
 include/linux/socket.h        |   1 +
 include/net/ip.h              |   3 +-
 include/uapi/linux/io_uring.h |  14 ++
 net/compat.c                  |   1 +
 net/core/datagram.c           |  59 +++++
 net/core/skbuff.c             |  18 +-
 net/ipv4/ip_output.c          |  35 ++-
 net/ipv4/udp.c                |   2 +-
 net/socket.c                  |   3 +
 11 files changed, 521 insertions(+), 27 deletions(-)

Comments

David Ahern Dec. 1, 2021, 3:10 a.m. UTC | #1
On 11/30/21 8:18 AM, Pavel Begunkov wrote:
> Early proof of concept for zerocopy send via io_uring. This is just
> an RFC, there are details yet to be figured out, but hope to gather
> some feedback.
> 
> Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
> The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
> or 2.44 times faster.
> 
> № | test:                                | BW (MB/s)  | speedup
> 1 | msg_zerocopy (non-zc)                |  18281     | 0.38
> 2 | msg_zerocopy -z (baseline)           |  47421     | 1
> 3 | io_uring (@flush=false, nr_reqs=1)   |  96534     | 2.03
> 4 | io_uring (@flush=true,  nr_reqs=1)   |  89310     | 1.88
> 5 | io_uring (@flush=false, nr_reqs=8)   | 116079     | 2.44
> 6 | io_uring (@flush=true,  nr_reqs=8)   | 109722     | 2.31
> 
> Based on selftests/.../msg_zerocopy but more limited. You can use
> msg_zerocopy -r as usual for receive side.
> 
...

Can you state the exact command lines you are running for all of the
commands? I tried this set (and commands referenced below) and my
mileage varies quite a bit.

Also, have you run this proposed change (and with TCP) across nodes
(ie., not just local process to local process via dummy interface)?

> Benchmark:
> https://github.com/isilence/liburing.git zc_v1
> 
> or this file in particular:
> https://github.com/isilence/liburing/blob/zc_v1/test/send-zc.c
> 
> To run the benchmark:
> ```
> cd <liburing_dir> && make && cd test
> # ./send-zc -4 [-p <port>] [-s <payload_size>] -D <destination> udp
> ./send-zc -4 -D 127.0.0.1 udp
> ```
> 
> msg_zerocopy can be used for the server side, e.g.
> ```
> cd <linux-kernel>/tools/testing/selftests/net && make
> ./msg_zerocopy -4 -r [-p <port>] [-t <sec>] udp
> ```
>
Pavel Begunkov Dec. 1, 2021, 2:31 p.m. UTC | #2
On 11/30/21 15:18, Pavel Begunkov wrote:
> Early proof of concept for zerocopy send via io_uring. This is just
> an RFC, there are details yet to be figured out, but hope to gather
> some feedback.

For larger audience, registered buffers is an io_uring feature
where it pins in advance the userspace memory and keeps it as
an array of pages (bvec in particular).
There is extra logic to make sure the pages stay referenced when
there are inflight requests using it, so they can avoid grabbing
extra page references.

Also, as was asked, attaching a standalone .c version of the
benchmark. Requires any relatively up-to-date liburing installed.

gcc -luring -O2 -o send-zc ./send-zc.c

There are also a couple of options added:

./send-zc [options] [-r <nr_submit_batching>] [-f] udp

-r <nr_submit_batch> is @nr_reqs from the cover-letter, 8 by default,
-f sets @flush to true, false by default.


> Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
> The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
> or 2.44 times faster.
> 
> № | test:                                | BW (MB/s)  | speedup
> 1 | msg_zerocopy (non-zc)                |  18281     | 0.38
> 2 | msg_zerocopy -z (baseline)           |  47421     | 1
> 3 | io_uring (@flush=false, nr_reqs=1)   |  96534     | 2.03
> 4 | io_uring (@flush=true,  nr_reqs=1)   |  89310     | 1.88
> 5 | io_uring (@flush=false, nr_reqs=8)   | 116079     | 2.44
> 6 | io_uring (@flush=true,  nr_reqs=8)   | 109722     | 2.31
> 
> Based on selftests/.../msg_zerocopy but more limited. You can use
> msg_zerocopy -r as usual for receive side.
> 
[...]
Pavel Begunkov Dec. 1, 2021, 3:32 p.m. UTC | #3
On 12/1/21 03:10, David Ahern wrote:
> On 11/30/21 8:18 AM, Pavel Begunkov wrote:
>> Early proof of concept for zerocopy send via io_uring. This is just
>> an RFC, there are details yet to be figured out, but hope to gather
>> some feedback.
>>
>> Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
>> The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
>> or 2.44 times faster.
>>
>> № | test:                                | BW (MB/s)  | speedup
>> 1 | msg_zerocopy (non-zc)                |  18281     | 0.38
>> 2 | msg_zerocopy -z (baseline)           |  47421     | 1
>> 3 | io_uring (@flush=false, nr_reqs=1)   |  96534     | 2.03
>> 4 | io_uring (@flush=true,  nr_reqs=1)   |  89310     | 1.88
>> 5 | io_uring (@flush=false, nr_reqs=8)   | 116079     | 2.44
>> 6 | io_uring (@flush=true,  nr_reqs=8)   | 109722     | 2.31
>>
>> Based on selftests/.../msg_zerocopy but more limited. You can use
>> msg_zerocopy -r as usual for receive side.
>>
> ...
> 
> Can you state the exact command lines you are running for all of the
> commands? I tried this set (and commands referenced below) and my

Sure. First, for dummy I set mtu by hand, not sure can do it from
the userspace, can I? Without it __ip_append_data() falls into
non-zerocopy path.

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index f82ad7419508..5c5aeacdabd5 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -132,7 +132,8 @@ static void dummy_setup(struct net_device *dev)
  	eth_hw_addr_random(dev);
  
  	dev->min_mtu = 0;
-	dev->max_mtu = 0;
+	dev->mtu = 0xffff;
+	dev->max_mtu = 0xffff;
  }

# dummy configuration

modprobe dummy numdummies=1
ip link set dummy0 up
# force requests to <dummy_ip_addr> go through the dummy device
ip route add <dummy_ip_addr> dev dummy0


With dummy I was just sinking the traffic to the dummy device,
was good enough for me. Omitting "taskset" and "nice":

send-zc -4 -D <dummy_ip_addr> -t 10 udp

Similarly with msg_zerocopy:

<kernel>/tools/testing/selftests/net/msg_zerocopy -4 -p 6666 -D <dummy_ip_addr> -t 10 -z udp


For loopback testing, as zerocopy is not allowed for it as Willem explained in
the original MSG_ZEROCOPY cover-letter, I used a hack to bypass it:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ebb12a7d386d..42df33b175ce 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2854,9 +2854,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
  /* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
  static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
  {
-	if (likely(!skb_zcopy(skb)))
-		return 0;
-	return skb_copy_ubufs(skb, gfp_mask);
+	return skb_orphan_frags(skb, gfp_mask);
  }
  
  /**

Then running those two lines below in parallel and looking for the numbers
send shows. It was in favor of io_uring for me, but don't remember
exactly. perf shows that "send-zc" spends lot of time receiving, so
wasn't testing performance of it after some point.

msg_zerocopy -r -v -4 -t 20 udp
send-zc -4 -D 127.0.0.1 -t 10 udp


> mileage varies quite a bit.

Interesting, any brief notes on the setup and the results? Dummy
or something real? io_uring doesn't show if it was really zerocopied
or not, but I assume you checked it (e.g. with perf/bpftrace).

I expected that @flush=true might be worse with real devices,
there is one spot to be patched, but apart from that and
cycles spend in a real LLD offseting the overhead, didn't
anticipate any problems. I'll see once I try a real device.


> Also, have you run this proposed change (and with TCP) across nodes
> (ie., not just local process to local process via dummy interface)?

Not yet, I tried dummy, and localhost UDP as per above and similarly
TCP. Just need to grab a server with a proper NIC, will try it out
soon.

>> Benchmark:
>> https://github.com/isilence/liburing.git zc_v1
>>
>> or this file in particular:
>> https://github.com/isilence/liburing/blob/zc_v1/test/send-zc.c
>>
>> To run the benchmark:
>> ```
>> cd <liburing_dir> && make && cd test
>> # ./send-zc -4 [-p <port>] [-s <payload_size>] -D <destination> udp
>> ./send-zc -4 -D 127.0.0.1 udp
>> ```
>>
>> msg_zerocopy can be used for the server side, e.g.
>> ```
>> cd <linux-kernel>/tools/testing/selftests/net && make
>> ./msg_zerocopy -4 -r [-p <port>] [-t <sec>] udp
>> ```
David Ahern Dec. 1, 2021, 5:49 p.m. UTC | #4
On 12/1/21 7:31 AM, Pavel Begunkov wrote:
> 
> Also, as was asked, attaching a standalone .c version of the
> benchmark. Requires any relatively up-to-date liburing installed.
> 
attached command differs from the version mentioned in the cover letter:

https://github.com/isilence/liburing.git zc_v1

copying this version into that branch, removing the duplicate
definitions and it works.
David Ahern Dec. 1, 2021, 5:57 p.m. UTC | #5
On 12/1/21 8:32 AM, Pavel Begunkov wrote:
> 
> Sure. First, for dummy I set mtu by hand, not sure can do it from
> the userspace, can I? Without it __ip_append_data() falls into
> non-zerocopy path.
> 
> diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
> index f82ad7419508..5c5aeacdabd5 100644
> --- a/drivers/net/dummy.c
> +++ b/drivers/net/dummy.c
> @@ -132,7 +132,8 @@ static void dummy_setup(struct net_device *dev)
>      eth_hw_addr_random(dev);
>  
>      dev->min_mtu = 0;
> -    dev->max_mtu = 0;
> +    dev->mtu = 0xffff;
> +    dev->max_mtu = 0xffff;
>  }
> 
> # dummy configuration
> 
> modprobe dummy numdummies=1
> ip link set dummy0 up


No change is needed to the dummy driver:
  ip li add dummy0 type dummy
  ip li set dummy0 up mtu 65536


> # force requests to <dummy_ip_addr> go through the dummy device
> ip route add <dummy_ip_addr> dev dummy0

that command is not necessary.

> 
> 
> With dummy I was just sinking the traffic to the dummy device,
> was good enough for me. Omitting "taskset" and "nice":
> 
> send-zc -4 -D <dummy_ip_addr> -t 10 udp
> 
> Similarly with msg_zerocopy:
> 
> <kernel>/tools/testing/selftests/net/msg_zerocopy -4 -p 6666 -D
> <dummy_ip_addr> -t 10 -z udp

I get -ENOBUFS with '-z' and any local address.

> 
> 
> For loopback testing, as zerocopy is not allowed for it as Willem
> explained in
> the original MSG_ZEROCOPY cover-letter, I used a hack to bypass it:
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index ebb12a7d386d..42df33b175ce 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2854,9 +2854,7 @@ static inline int skb_orphan_frags(struct sk_buff
> *skb, gfp_t gfp_mask)
>  /* Frags must be orphaned, even if refcounted, if skb might loop to rx
> path */
>  static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
>  {
> -    if (likely(!skb_zcopy(skb)))
> -        return 0;
> -    return skb_copy_ubufs(skb, gfp_mask);
> +    return skb_orphan_frags(skb, gfp_mask);
>  }
>  

that is the key change that is missing in your repo. All local traffic
(traffic to the address on a dummy device falls into this comment) goes
through loopback. That's just the way Linux works. If you look at the
dummy driver, it's xmit function just drops packets if any actually make
it there.


>> mileage varies quite a bit.
> 
> Interesting, any brief notes on the setup and the results? Dummy

VM on Chromebook. I just cloned your repos, built, install and test. As
mentioned above, the skb_orphan_frags_rx change is missing from your
repo and that is the key to your reported performance gains.
Willem de Bruijn Dec. 1, 2021, 6:10 p.m. UTC | #6
> # performance:
>
> The worst case for io_uring is (4), still 1.88 times faster than
> msg_zerocopy (2), and there are a couple of "easy" optimisations left
> out from the patchset. For 4096 bytes payload zc is only slightly
> outperforms non-zc version, the larger payload the wider gap.
> I'll get more numbers next time.

> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
> much. Notification posting is not a big problem for now, but need
> to compare the performance for when io_uring_tx_zerocopy_callback()
> is called from IRQ context, and possible rework it to use task_work.
>
> It supports both, regular buffers and fixed ones, but there is a bunch of
> optimisations exclusively for io_uring's fixed buffers. For comparison,
> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>
> 1) we pass a bvec, so no page table walks.
> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>    still doing page get/put (see 4/12) slashed 4-5%.
> 3) avoiding get_page/put_page in 5/12
> 4) completion events are posted into io_uring's CQ, so no
>    extra recvmsg for getting events
> 5) no poll(2) in the code because of io_uring
> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>    io_uring caches the structures reducing it to nearly zero-overhead.

Nice set of complementary optimizations.

We have looked at adding some of those as independent additions to
msg_zerocopy before, such as long-term pinned regions. One issue with
that is that the pages must remain until the request completes,
regardless of whether the calling process is alive. So it cannot rely
on a pinned range held by a process only.

If feasible, it would be preferable if the optimizations can be added
to msg_zerocopy directly, rather than adding a dependency on io_uring
to make use of them. But not sure how feasible that is. For some, like
4 and 5, the answer is clearly it isn't.  6, it probably is?

> # discussion / questions
>
> I haven't got a grasp on many aspects of the net stack yet, so would
> appreciate feedback in general and there are a couple of questions
> thoughts.
>
> 1) What are initialisation rules for adding a new field into
> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
> filling all the fields.
>
> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
> __ip_append_data() (see 3/12). Ideas how to do it better?

Agreed that both of these are less than ideal.

I can't comment too much on the io_uring aspect of the patch series.
But msg_zerocopy is probably used in a small fraction of traffic (even
if a high fraction for users who care about its benefits). We have to
try to minimize the cost incurred on the general hot path.

I was going to suggest using the standard msg_zerocopy ubuf_info
alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
in the cycle profile.

It might still be possible to somehow signal to msg_zerocopy_alloc
that this is being called from within an io_uring request, and
therefore should use a pre-existing uarg with different
uarg->callback. If nothing else, some info can be passed as a cmsg.
But perhaps there is a more direct pointer path to follow from struct
sk, say? Here my limited knowledge of io_uring forces me to hand wave.

Probably also want to see how all this would integrate with TCP. In
some ways, that might be easier, as it does not have the indirection
through ip_make_skb, etc.
Pavel Begunkov Dec. 1, 2021, 7:11 p.m. UTC | #7
On 12/1/21 17:57, David Ahern wrote:
> On 12/1/21 8:32 AM, Pavel Begunkov wrote:
>>
>> Sure. First, for dummy I set mtu by hand, not sure can do it from
>> the userspace, can I? Without it __ip_append_data() falls into
>> non-zerocopy path.
>>
[...]
>>
>> modprobe dummy numdummies=1
>> ip link set dummy0 up
> 
> 
> No change is needed to the dummy driver:
>    ip li add dummy0 type dummy
>    ip li set dummy0 up mtu 65536

awesome, thanks!

>> # force requests to <dummy_ip_addr> go through the dummy device
>> ip route add <dummy_ip_addr> dev dummy0
> 
> that command is not necessary.
> 
>>
>>
>> With dummy I was just sinking the traffic to the dummy device,
>> was good enough for me. Omitting "taskset" and "nice":
>>
>> send-zc -4 -D <dummy_ip_addr> -t 10 udp
>>
>> Similarly with msg_zerocopy:
>>
>> <kernel>/tools/testing/selftests/net/msg_zerocopy -4 -p 6666 -D
>> <dummy_ip_addr> -t 10 -z udp
> 
> I get -ENOBUFS with '-z' and any local address.

Ah, right. Citing from Willem's MSG_ZEROCOPY letter:

"
Notification skbuffs are allocated from optmem. For sockets that
cannot effectively coalesce notifications, the optmem max may need
to be increased to avoid hitting -ENOBUFS:

   sysctl -w net.core.optmem_max=1048576
"


>> For loopback testing, as zerocopy is not allowed for it as Willem
>> explained in
>> the original MSG_ZEROCOPY cover-letter, I used a hack to bypass it:
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index ebb12a7d386d..42df33b175ce 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -2854,9 +2854,7 @@ static inline int skb_orphan_frags(struct sk_buff
>> *skb, gfp_t gfp_mask)
>>   /* Frags must be orphaned, even if refcounted, if skb might loop to rx
>> path */
>>   static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
>>   {
>> -    if (likely(!skb_zcopy(skb)))
>> -        return 0;
>> -    return skb_copy_ubufs(skb, gfp_mask);
>> +    return skb_orphan_frags(skb, gfp_mask);
>>   }
>>   
> 
> that is the key change that is missing in your repo. All local traffic
> (traffic to the address on a dummy device falls into this comment) goes
> through loopback. That's just the way Linux works. If you look at the
> dummy driver, it's xmit function just drops packets if any actually make
> it there.

Not at all, the measurements were done without this patch. In case it
may shed some light, attaching a fresh flamegraph, same 115761.6 MB/s

btw, why a dummy device would ever go through loopback? It doesn't
seem to make sense, though may be missing something.


>>> mileage varies quite a bit.
>>
>> Interesting, any brief notes on the setup and the results? Dummy
> 
> VM on Chromebook. I just cloned your repos, built, install and test. As
> mentioned above, the skb_orphan_frags_rx change is missing from your
> repo and that is the key to your reported performance gains.
>
David Ahern Dec. 1, 2021, 7:20 p.m. UTC | #8
On 12/1/21 12:11 PM, Pavel Begunkov wrote:
> btw, why a dummy device would ever go through loopback? It doesn't
> seem to make sense, though may be missing something.

You are sending to a local ip address, so the fib_lookup returns
RTN_LOCAL. The code makes dev_out the loopback:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773

(you are not using vrf so ignore the l3mdev reference). loopback device
has the logic to put the skb back in the stack for Rx processing:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/loopback.c#n68
Pavel Begunkov Dec. 1, 2021, 7:59 p.m. UTC | #9
On 12/1/21 18:10, Willem de Bruijn wrote:
>> # performance:
>>
>> The worst case for io_uring is (4), still 1.88 times faster than
>> msg_zerocopy (2), and there are a couple of "easy" optimisations left
>> out from the patchset. For 4096 bytes payload zc is only slightly
>> outperforms non-zc version, the larger payload the wider gap.
>> I'll get more numbers next time.
> 
>> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
>> much. Notification posting is not a big problem for now, but need
>> to compare the performance for when io_uring_tx_zerocopy_callback()
>> is called from IRQ context, and possible rework it to use task_work.
>>
>> It supports both, regular buffers and fixed ones, but there is a bunch of
>> optimisations exclusively for io_uring's fixed buffers. For comparison,
>> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>>
>> 1) we pass a bvec, so no page table walks.
>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>>     still doing page get/put (see 4/12) slashed 4-5%.
>> 3) avoiding get_page/put_page in 5/12
>> 4) completion events are posted into io_uring's CQ, so no
>>     extra recvmsg for getting events
>> 5) no poll(2) in the code because of io_uring
>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>>     io_uring caches the structures reducing it to nearly zero-overhead.
> 
> Nice set of complementary optimizations.
> 
> We have looked at adding some of those as independent additions to
> msg_zerocopy before, such as long-term pinned regions. One issue with
> that is that the pages must remain until the request completes,
> regardless of whether the calling process is alive. So it cannot rely
> on a pinned range held by a process only.
> 
> If feasible, it would be preferable if the optimizations can be added
> to msg_zerocopy directly, rather than adding a dependency on io_uring
> to make use of them. But not sure how feasible that is. For some, like
> 4 and 5, the answer is clearly it isn't.  6, it probably is?

And for 3), io_uring has a complex infra for keeping pages alive,
the additional overhead is one almost percpu_ref_put() per
request/notification, or even better in common cases. Not sure it's
feasible/possible with current msg_zerocopy. Also, io_uring's
ubufs are kept as a part of a larger structure, which may complicate
things.


>> # discussion / questions
>>
>> I haven't got a grasp on many aspects of the net stack yet, so would
>> appreciate feedback in general and there are a couple of questions
>> thoughts.
>>
>> 1) What are initialisation rules for adding a new field into
>> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
>> filling all the fields.
>>
>> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
>> __ip_append_data() (see 3/12). Ideas how to do it better?
> 
> Agreed that both of these are less than ideal.
> 
> I can't comment too much on the io_uring aspect of the patch series.
> But msg_zerocopy is probably used in a small fraction of traffic (even
> if a high fraction for users who care about its benefits). We have to
> try to minimize the cost incurred on the general hot path.

One thing, I can hide the initial ubuf check in the beginning of
__ip_append_data() under a common

if (sock_flag(sk, SOCK_ZEROCOPY)) {}

But as SOCK_ZEROCOPY is more of a design problem workaround,
tbh not sure I like from the API perspective. Thoughts? I hope
I can also shuffle some of the stuff in 5/12 out of the
hot path, need to dig a bit deeper.

> I was going to suggest using the standard msg_zerocopy ubuf_info
> alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
> in the cycle profile.
> 
> It might still be possible to somehow signal to msg_zerocopy_alloc
> that this is being called from within an io_uring request, and
> therefore should use a pre-existing uarg with different
> uarg->callback. If nothing else, some info can be passed as a cmsg.
> But perhaps there is a more direct pointer path to follow from struct
> sk, say? Here my limited knowledge of io_uring forces me to hand wave.

One thing I consider important though is to be able to specify a
ubuf per request, but not somehow registering it in a socket. It's
more flexible from the userspace API perspective. It would also need
constant register/unregister, and there are concerns with
referencing/cancellations, that's where it came from in the first
place.

IOW, I'd really prefer to pass it down on a per request basis.

> Probably also want to see how all this would integrate with TCP. In
> some ways, that might be easier, as it does not have the indirection
> through ip_make_skb, etc.

Worked well in general, but patches I used should be a broken for
some input after adding 5/12, so need some work. will send next time.
Pavel Begunkov Dec. 1, 2021, 7:59 p.m. UTC | #10
On 12/1/21 17:49, David Ahern wrote:
> On 12/1/21 7:31 AM, Pavel Begunkov wrote:
>>
>> Also, as was asked, attaching a standalone .c version of the
>> benchmark. Requires any relatively up-to-date liburing installed.
>>
> attached command differs from the version mentioned in the cover letter:

I guess you mean the new options mentioned in the message that
were added to that standalone script. They look useful, will
update the repo with a similar change later.


> https://github.com/isilence/liburing.git zc_v1
> 
> copying this version into that branch, removing the duplicate
> definitions and it works.
>
Pavel Begunkov Dec. 1, 2021, 8:15 p.m. UTC | #11
On 12/1/21 19:20, David Ahern wrote:
> On 12/1/21 12:11 PM, Pavel Begunkov wrote:
>> btw, why a dummy device would ever go through loopback? It doesn't
>> seem to make sense, though may be missing something.
> 
> You are sending to a local ip address, so the fib_lookup returns
> RTN_LOCAL. The code makes dev_out the loopback:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773

I see, thanks. I still don't use the skb_orphan_frags_rx() hack
and it doesn't go through the loopback (for my dummy tests), just
dummy_xmit() and no mention of loopback in perf data, see the
flamegraph. Don't know what is the catch.

I'm illiterate of the routing paths. Can it be related to
the "ip route add"? How do you get an ipv4 address for the device?

> 
> (you are not using vrf so ignore the l3mdev reference). loopback device
> has the logic to put the skb back in the stack for Rx processing:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/loopback.c#n68
>
Pavel Begunkov Dec. 1, 2021, 8:29 p.m. UTC | #12
On 12/1/21 19:59, Pavel Begunkov wrote:
> On 12/1/21 18:10, Willem de Bruijn wrote:
>>> # performance:
>>>
>>> The worst case for io_uring is (4), still 1.88 times faster than
>>> msg_zerocopy (2), and there are a couple of "easy" optimisations left
>>> out from the patchset. For 4096 bytes payload zc is only slightly
>>> outperforms non-zc version, the larger payload the wider gap.
>>> I'll get more numbers next time.
>>
>>> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
>>> much. Notification posting is not a big problem for now, but need
>>> to compare the performance for when io_uring_tx_zerocopy_callback()
>>> is called from IRQ context, and possible rework it to use task_work.
>>>
>>> It supports both, regular buffers and fixed ones, but there is a bunch of
>>> optimisations exclusively for io_uring's fixed buffers. For comparison,
>>> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>>>
>>> 1) we pass a bvec, so no page table walks.
>>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>>>     still doing page get/put (see 4/12) slashed 4-5%.
>>> 3) avoiding get_page/put_page in 5/12
>>> 4) completion events are posted into io_uring's CQ, so no
>>>     extra recvmsg for getting events
>>> 5) no poll(2) in the code because of io_uring
>>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>>>     io_uring caches the structures reducing it to nearly zero-overhead.
>>
>> Nice set of complementary optimizations.
>>
>> We have looked at adding some of those as independent additions to
>> msg_zerocopy before, such as long-term pinned regions. One issue with
>> that is that the pages must remain until the request completes,
>> regardless of whether the calling process is alive. So it cannot rely
>> on a pinned range held by a process only.
>>
>> If feasible, it would be preferable if the optimizations can be added
>> to msg_zerocopy directly, rather than adding a dependency on io_uring
>> to make use of them. But not sure how feasible that is. For some, like
>> 4 and 5, the answer is clearly it isn't.  6, it probably is?

Forgot about 6), io_uring uses the fact that submissions are
done under an per ring mutex, and completions are under a per
ring spinlock, so there are two lists for them and no extra
locking. Lists are spliced in a batched manner, so it's
1 spinlock per N (e.g. 32) cached ubuf_info's allocations.

Any similar guarantees for sockets?

[...]
Pavel Begunkov Dec. 1, 2021, 8:42 p.m. UTC | #13
On 12/1/21 17:57, David Ahern wrote:
> On 12/1/21 8:32 AM, Pavel Begunkov wrote:
[...]
>>> mileage varies quite a bit.
>>
>> Interesting, any brief notes on the setup and the results? Dummy
> 
> VM on Chromebook. I just cloned your repos, built, install and test. As
> mentioned above, the skb_orphan_frags_rx change is missing from your
> repo and that is the key to your reported performance gains.

Just to clear misunderstandings if any, all the numbers in the
cover-letter were measured on the same kernel and during the same
boot, and it doesn't include the skb_orphan_frags_rx() change.
All double checked by looking at the traces.

When it's routed through the loopback paths (e.g. -D 127.0.0.1),
it's indeed slow for both msg_zerocopy and send-zc, and both
"benefit" in raw throughput from the hack.
Martin KaFai Lau Dec. 1, 2021, 9:51 p.m. UTC | #14
On Wed, Dec 01, 2021 at 08:15:28PM +0000, Pavel Begunkov wrote:
> On 12/1/21 19:20, David Ahern wrote:
> > On 12/1/21 12:11 PM, Pavel Begunkov wrote:
> > > btw, why a dummy device would ever go through loopback? It doesn't
> > > seem to make sense, though may be missing something.
> > 
> > You are sending to a local ip address, so the fib_lookup returns
> > RTN_LOCAL. The code makes dev_out the loopback:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773
> 
> I see, thanks. I still don't use the skb_orphan_frags_rx() hack
> and it doesn't go through the loopback (for my dummy tests), just
> dummy_xmit() and no mention of loopback in perf data, see the
> flamegraph. Don't know what is the catch.
> 
> I'm illiterate of the routing paths. Can it be related to
> the "ip route add"? How do you get an ipv4 address for the device?
I also bumped into the udp-connect() => ECONNREFUSED (111) error from send-zc.
because I assumed no server is needed by using dummy.  Then realized
the cover letter mentioned msg_zerocopy is used as the server.
Mentioning just in case someone hits it also.

To tx out dummy, I did:
#> ip a add 10.0.0.1/24 dev dummy0

#> ip -4 r
10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1

#> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
ip -s link show dev dummy0
2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
   link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
   RX:    bytes packets errors dropped  missed   mcast
              0       0      0       0       0       0
   TX:    bytes packets errors dropped carrier collsns
   140800890299 2150397      0       0       0       0
David Ahern Dec. 1, 2021, 10:35 p.m. UTC | #15
On 12/1/21 2:51 PM, Martin KaFai Lau wrote:
> 
> To tx out dummy, I did:
> #> ip a add 10.0.0.1/24 dev dummy0
              ^^^^^^^^
> 
> #> ip -4 r
> 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
> 
> #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
                     ^^^^^^^^^^

Pavel's commands have: 'send-zc -4 -D <dummy_ip_addr> -t 10 udp'

I read dummy_ip_addr as the address assigned to dummy0; that's an
important detail. You are sending to an address on that network, not the
address assigned to the device, in which case packets are created and
then dropped by the dummy driver - nothing actually makes it to the server.


> ip -s link show dev dummy0
> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>    link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
>    RX:    bytes packets errors dropped  missed   mcast
>               0       0      0       0       0       0
>    TX:    bytes packets errors dropped carrier collsns
>    140800890299 2150397      0       0       0       0
>
Martin KaFai Lau Dec. 1, 2021, 11:07 p.m. UTC | #16
On Wed, Dec 01, 2021 at 03:35:49PM -0700, David Ahern wrote:
> On 12/1/21 2:51 PM, Martin KaFai Lau wrote:
> > 
> > To tx out dummy, I did:
> > #> ip a add 10.0.0.1/24 dev dummy0
>               ^^^^^^^^
> > 
> > #> ip -4 r
> > 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
> > 
> > #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
>                      ^^^^^^^^^^
> 
> Pavel's commands have: 'send-zc -4 -D <dummy_ip_addr> -t 10 udp'
> 
> I read dummy_ip_addr as the address assigned to dummy0; that's an
> important detail. You are sending to an address on that network, not the
> address assigned to the device, in which case packets are created and
> then dropped by the dummy driver - nothing actually makes it to the server.
Right, I assumed "dropped by dummy driver" is the usual intention
for using dummy, so just in case if it was the intention for
testing tx only.  You are right and it seems the intention
of this command is to have server receiving the packets.

> 
> 
> > ip -s link show dev dummy0
> > 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> >    link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
> >    RX:    bytes packets errors dropped  missed   mcast
> >               0       0      0       0       0       0
> >    TX:    bytes packets errors dropped carrier collsns
> >    140800890299 2150397      0       0       0       0
> > 
>
Pavel Begunkov Dec. 1, 2021, 11:18 p.m. UTC | #17
On 12/1/21 23:07, Martin KaFai Lau wrote:
> On Wed, Dec 01, 2021 at 03:35:49PM -0700, David Ahern wrote:
>> On 12/1/21 2:51 PM, Martin KaFai Lau wrote:
>>>
>>> To tx out dummy, I did:
>>> #> ip a add 10.0.0.1/24 dev dummy0
>>                ^^^^^^^^
>>>
>>> #> ip -4 r
>>> 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
>>>
>>> #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
>>                       ^^^^^^^^^^
>>
>> Pavel's commands have: 'send-zc -4 -D <dummy_ip_addr> -t 10 udp'
>>
>> I read dummy_ip_addr as the address assigned to dummy0; that's an
>> important detail. You are sending to an address on that network, not the
>> address assigned to the device, in which case packets are created and
>> then dropped by the dummy driver - nothing actually makes it to the server.
> Right, I assumed "dropped by dummy driver" is the usual intention
> for using dummy, so just in case if it was the intention for
> testing tx only.  You are right and it seems the intention
> of this command is to have server receiving the packets.

I see, it seems we found the misunderstanding:

For dummy device, I indeed was testing only tx part without
a server, in my understanding it better approximates an actual
fast NIC. I don't think dummy is even capable to pass the data,
so was thinking that it's given that there is no receive side,
and so no "msg_zerocopy -r".

For localhost testing (with the hack), there was a server
verifying data.


>>
>>> ip -s link show dev dummy0
>>> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>>>     link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
>>>     RX:    bytes packets errors dropped  missed   mcast
>>>                0       0      0       0       0       0
>>>     TX:    bytes packets errors dropped carrier collsns
>>>     140800890299 2150397      0       0       0       0
>>>
>>
Willem de Bruijn Dec. 2, 2021, 12:32 a.m. UTC | #18
> >> # discussion / questions
> >>
> >> I haven't got a grasp on many aspects of the net stack yet, so would
> >> appreciate feedback in general and there are a couple of questions
> >> thoughts.
> >>
> >> 1) What are initialisation rules for adding a new field into
> >> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
> >> filling all the fields.
> >>
> >> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
> >> __ip_append_data() (see 3/12). Ideas how to do it better?
> >
> > Agreed that both of these are less than ideal.
> >
> > I can't comment too much on the io_uring aspect of the patch series.
> > But msg_zerocopy is probably used in a small fraction of traffic (even
> > if a high fraction for users who care about its benefits). We have to
> > try to minimize the cost incurred on the general hot path.
>
> One thing, I can hide the initial ubuf check in the beginning of
> __ip_append_data() under a common
>
> if (sock_flag(sk, SOCK_ZEROCOPY)) {}
>
> But as SOCK_ZEROCOPY is more of a design problem workaround,
> tbh not sure I like from the API perspective. Thoughts?

Agreed. io_uring does not have the legacy concerns that msg_zerocopy
had to resolve.

It is always possible to hide runtime overhead behind a static_branch,
if nothing else.

Or perhaps do pass the flag and use that:

  - if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
  + if (flags & MSG_ZEROCOPY && length) {
  +         if (uarg) {

  etc.


> I hope
> I can also shuffle some of the stuff in 5/12 out of the
> hot path, need to dig a bit deeper.
>
> > I was going to suggest using the standard msg_zerocopy ubuf_info
> > alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
> > in the cycle profile.
> >
> > It might still be possible to somehow signal to msg_zerocopy_alloc
> > that this is being called from within an io_uring request, and
> > therefore should use a pre-existing uarg with different
> > uarg->callback. If nothing else, some info can be passed as a cmsg.
> > But perhaps there is a more direct pointer path to follow from struct
> > sk, say? Here my limited knowledge of io_uring forces me to hand wave.
>
> One thing I consider important though is to be able to specify a
> ubuf per request, but not somehow registering it in a socket. It's
> more flexible from the userspace API perspective. It would also need
> constant register/unregister, and there are concerns with
> referencing/cancellations, that's where it came from in the first
> place.

What if the ubuf pool can be found from the sk, and the index in that
pool is passed as a cmsg?
Willem de Bruijn Dec. 2, 2021, 12:36 a.m. UTC | #19
> >>> 1) we pass a bvec, so no page table walks.
> >>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
> >>>     still doing page get/put (see 4/12) slashed 4-5%.
> >>> 3) avoiding get_page/put_page in 5/12
> >>> 4) completion events are posted into io_uring's CQ, so no
> >>>     extra recvmsg for getting events
> >>> 5) no poll(2) in the code because of io_uring
> >>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
> >>>     io_uring caches the structures reducing it to nearly zero-overhead.
> >>
> >> Nice set of complementary optimizations.
> >>
> >> We have looked at adding some of those as independent additions to
> >> msg_zerocopy before, such as long-term pinned regions. One issue with
> >> that is that the pages must remain until the request completes,
> >> regardless of whether the calling process is alive. So it cannot rely
> >> on a pinned range held by a process only.
> >>
> >> If feasible, it would be preferable if the optimizations can be added
> >> to msg_zerocopy directly, rather than adding a dependency on io_uring
> >> to make use of them. But not sure how feasible that is. For some, like
> >> 4 and 5, the answer is clearly it isn't.  6, it probably is?
>
> Forgot about 6), io_uring uses the fact that submissions are
> done under an per ring mutex, and completions are under a per
> ring spinlock, so there are two lists for them and no extra
> locking. Lists are spliced in a batched manner, so it's
> 1 spinlock per N (e.g. 32) cached ubuf_info's allocations.
>
> Any similar guarantees for sockets?

For datagrams it might matter, not sure if it would show up in a
profile. The current notification mechanism is quite a bit more
heavyweight than any form of fixed ubuf pool.

For TCP this matters less, as multiple sends are not needed and
completions are coalesced, because in order.
Pavel Begunkov Dec. 2, 2021, 3:48 p.m. UTC | #20
On 12/1/21 21:51, Martin KaFai Lau wrote:
> On Wed, Dec 01, 2021 at 08:15:28PM +0000, Pavel Begunkov wrote:
>> On 12/1/21 19:20, David Ahern wrote:
>>> On 12/1/21 12:11 PM, Pavel Begunkov wrote:
>>>> btw, why a dummy device would ever go through loopback? It doesn't
>>>> seem to make sense, though may be missing something.
>>>
>>> You are sending to a local ip address, so the fib_lookup returns
>>> RTN_LOCAL. The code makes dev_out the loopback:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773
>>
>> I see, thanks. I still don't use the skb_orphan_frags_rx() hack
>> and it doesn't go through the loopback (for my dummy tests), just
>> dummy_xmit() and no mention of loopback in perf data, see the
>> flamegraph. Don't know what is the catch.
>>
>> I'm illiterate of the routing paths. Can it be related to
>> the "ip route add"? How do you get an ipv4 address for the device?
> I also bumped into the udp-connect() => ECONNREFUSED (111) error from send-zc.
> because I assumed no server is needed by using dummy.  Then realized
> the cover letter mentioned msg_zerocopy is used as the server.
> Mentioning just in case someone hits it also.
> 
> To tx out dummy, I did:
> #> ip a add 10.0.0.1/24 dev dummy0

Works well for me, IOW getting the same behaviour as with my
ip route add <ip> dev dummy0

I'm curious what is the difference bw them?


> #> ip -4 r
> 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
> 
> #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
> ip -s link show dev dummy0
> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>     link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
>     RX:    bytes packets errors dropped  missed   mcast
>                0       0      0       0       0       0
>     TX:    bytes packets errors dropped carrier collsns
>     140800890299 2150397      0       0       0       0
>
Pavel Begunkov Dec. 2, 2021, 4:25 p.m. UTC | #21
On 12/2/21 00:36, Willem de Bruijn wrote:
>>>>> 1) we pass a bvec, so no page table walks.
>>>>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>>>>>      still doing page get/put (see 4/12) slashed 4-5%.
>>>>> 3) avoiding get_page/put_page in 5/12
>>>>> 4) completion events are posted into io_uring's CQ, so no
>>>>>      extra recvmsg for getting events
>>>>> 5) no poll(2) in the code because of io_uring
>>>>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>>>>>      io_uring caches the structures reducing it to nearly zero-overhead.
>>>>
>>>> Nice set of complementary optimizations.
>>>>
>>>> We have looked at adding some of those as independent additions to
>>>> msg_zerocopy before, such as long-term pinned regions. One issue with
>>>> that is that the pages must remain until the request completes,
>>>> regardless of whether the calling process is alive. So it cannot rely
>>>> on a pinned range held by a process only.
>>>>
>>>> If feasible, it would be preferable if the optimizations can be added
>>>> to msg_zerocopy directly, rather than adding a dependency on io_uring
>>>> to make use of them. But not sure how feasible that is. For some, like
>>>> 4 and 5, the answer is clearly it isn't.  6, it probably is?
>>
>> Forgot about 6), io_uring uses the fact that submissions are
>> done under an per ring mutex, and completions are under a per
>> ring spinlock, so there are two lists for them and no extra
>> locking. Lists are spliced in a batched manner, so it's
>> 1 spinlock per N (e.g. 32) cached ubuf_info's allocations.
>>
>> Any similar guarantees for sockets?
> 
> For datagrams it might matter, not sure if it would show up in a
> profile. The current notification mechanism is quite a bit more
> heavyweight than any form of fixed ubuf pool.

Just to give an idea what I'm seeing in profiles: while testing

3 | io_uring (@flush=false, nr_reqs=1)   |  96534     | 2.03

I found that removing one extra smb_mb() per request in io_uring
gave around +0.65% of t-put (quick testing). In profiles the
function where it was dropped from 0.93% to 0.09%.

 From what I see, alloc+free takes 6-10% for 64KB UDP, it may be
great to have something for MSG_ZEROCOPY, but if that adds
additional locking/atomics, honestly I'd prefer to keep it separate
from io_uring's caching.

I also hope we can optimise generic paths at some point, and the
faster it gets the more such additional locking will hurt, pretty
much how it was with the block layer.

> For TCP this matters less, as multiple sends are not needed and
> completions are coalesced, because in order.
>
Pavel Begunkov Dec. 2, 2021, 4:45 p.m. UTC | #22
On 12/2/21 00:32, Willem de Bruijn wrote:
>>>> # discussion / questions
>>>>
>>>> I haven't got a grasp on many aspects of the net stack yet, so would
>>>> appreciate feedback in general and there are a couple of questions
>>>> thoughts.
>>>>
>>>> 1) What are initialisation rules for adding a new field into
>>>> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
>>>> filling all the fields.
>>>>
>>>> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
>>>> __ip_append_data() (see 3/12). Ideas how to do it better?
>>>
>>> Agreed that both of these are less than ideal.
>>>
>>> I can't comment too much on the io_uring aspect of the patch series.
>>> But msg_zerocopy is probably used in a small fraction of traffic (even
>>> if a high fraction for users who care about its benefits). We have to
>>> try to minimize the cost incurred on the general hot path.
>>
>> One thing, I can hide the initial ubuf check in the beginning of
>> __ip_append_data() under a common
>>
>> if (sock_flag(sk, SOCK_ZEROCOPY)) {}
>>
>> But as SOCK_ZEROCOPY is more of a design problem workaround,
>> tbh not sure I like from the API perspective. Thoughts?
> 
> Agreed. io_uring does not have the legacy concerns that msg_zerocopy
> had to resolve.
> 
> It is always possible to hide runtime overhead behind a static_branch,
> if nothing else.
> 
> Or perhaps do pass the flag and use that:
> 
>    - if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
>    + if (flags & MSG_ZEROCOPY && length) {
>    +         if (uarg) {
> 
>    etc.

Good idea. Unfortunately, not going to work (SOCK_ZEROCOPY would neither)
because we pass ubuf as a parameter into the function, and e.g. we need
to NULL it if not used, but at least good for tcp_sendmsg_locked

>> I hope
>> I can also shuffle some of the stuff in 5/12 out of the
>> hot path, need to dig a bit deeper.
>>
>>> I was going to suggest using the standard msg_zerocopy ubuf_info
>>> alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
>>> in the cycle profile.
>>>
>>> It might still be possible to somehow signal to msg_zerocopy_alloc
>>> that this is being called from within an io_uring request, and
>>> therefore should use a pre-existing uarg with different
>>> uarg->callback. If nothing else, some info can be passed as a cmsg.
>>> But perhaps there is a more direct pointer path to follow from struct
>>> sk, say? Here my limited knowledge of io_uring forces me to hand wave.
>>
>> One thing I consider important though is to be able to specify a
>> ubuf per request, but not somehow registering it in a socket. It's
>> more flexible from the userspace API perspective. It would also need
>> constant register/unregister, and there are concerns with
>> referencing/cancellations, that's where it came from in the first
>> place.
> 
> What if the ubuf pool can be found from the sk, and the index in that
> pool is passed as a cmsg?

It looks to me that ubufs are by nature is something that is not
tightly bound to a socket (at least for io_uring API in the patchset),
it'll be pretty ugly:

1) io_uring'd need to care to register the pool in the socket. Having
multiple rings using the same socket would be horrible. It may be that
it doesn't make much sense to send in parallel from multiple rings, but
a per thread io_uring is a popular solution, and then someone would
want to pass a socket from one thread to another and we'd need to support
it.

2) And io_uring would also need to unregister it, so the pool would
store a list of sockets where it's used, and so referencing sockets
and then we need to bind it somehow to io_uring fixed files or
register all that for tracking referencing circular dependencies.

3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
but if so I don't like exposing basically io_uring's referencing through
cmsg. And it sounds io_uring would need to parse cmsg then.


A lot of nuances :) I'd really prefer to pass it on per-request basis,
it's much cleaner, but still haven't got what's up with msghdr
initialisation...

Maybe, it's better to add a flags field, which would include
"msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
and everything else that may be optional. Does it sound sane?
Martin KaFai Lau Dec. 2, 2021, 5:40 p.m. UTC | #23
On Thu, Dec 02, 2021 at 03:48:14PM +0000, Pavel Begunkov wrote:
> On 12/1/21 21:51, Martin KaFai Lau wrote:
> > On Wed, Dec 01, 2021 at 08:15:28PM +0000, Pavel Begunkov wrote:
> > > On 12/1/21 19:20, David Ahern wrote:
> > > > On 12/1/21 12:11 PM, Pavel Begunkov wrote:
> > > > > btw, why a dummy device would ever go through loopback? It doesn't
> > > > > seem to make sense, though may be missing something.
> > > > 
> > > > You are sending to a local ip address, so the fib_lookup returns
> > > > RTN_LOCAL. The code makes dev_out the loopback:
> > > > 
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773
> > > 
> > > I see, thanks. I still don't use the skb_orphan_frags_rx() hack
> > > and it doesn't go through the loopback (for my dummy tests), just
> > > dummy_xmit() and no mention of loopback in perf data, see the
> > > flamegraph. Don't know what is the catch.
> > > 
> > > I'm illiterate of the routing paths. Can it be related to
> > > the "ip route add"? How do you get an ipv4 address for the device?
> > I also bumped into the udp-connect() => ECONNREFUSED (111) error from send-zc.
> > because I assumed no server is needed by using dummy.  Then realized
> > the cover letter mentioned msg_zerocopy is used as the server.
> > Mentioning just in case someone hits it also.
> > 
> > To tx out dummy, I did:
> > #> ip a add 10.0.0.1/24 dev dummy0
> 
> Works well for me, IOW getting the same behaviour as with my
> ip route add <ip> dev dummy0
> 
> I'm curious what is the difference bw them?
No difference.  It should be the same.  The skb should still go out
of dummy (instead of lo) and then get drop/kfree.  I think
the confusion is probably from the name "<dummy_ip_addr>" which
points to the intention that the dummy0 has this ip addr
instead of dummy having a route to this ip address.
The need for running msg_zerocopy as the server also
adds to this confusion.  There should be no need for
server in dummy test.  No skb can reach the server anyway.

> 
> 
> > #> ip -4 r
> > 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
> > 
> > #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
> > ip -s link show dev dummy0
> > 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> >     link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
> >     RX:    bytes packets errors dropped  missed   mcast
> >                0       0      0       0       0       0
> >     TX:    bytes packets errors dropped carrier collsns
> >     140800890299 2150397      0       0       0       0
> > 
> 
> -- 
> Pavel Begunkov
Willem de Bruijn Dec. 2, 2021, 9:25 p.m. UTC | #24
> > What if the ubuf pool can be found from the sk, and the index in that
> > pool is passed as a cmsg?
>
> It looks to me that ubufs are by nature is something that is not
> tightly bound to a socket (at least for io_uring API in the patchset),
> it'll be pretty ugly:
>
> 1) io_uring'd need to care to register the pool in the socket. Having
> multiple rings using the same socket would be horrible. It may be that
> it doesn't make much sense to send in parallel from multiple rings, but
> a per thread io_uring is a popular solution, and then someone would
> want to pass a socket from one thread to another and we'd need to support
> it.
>
> 2) And io_uring would also need to unregister it, so the pool would
> store a list of sockets where it's used, and so referencing sockets
> and then we need to bind it somehow to io_uring fixed files or
> register all that for tracking referencing circular dependencies.
>
> 3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
> but if so I don't like exposing basically io_uring's referencing through
> cmsg. And it sounds io_uring would need to parse cmsg then.
>
>
> A lot of nuances :) I'd really prefer to pass it on per-request basis,

Ok

> it's much cleaner, but still haven't got what's up with msghdr
> initialisation...

And passing the struct through multiple layers of functions.

> Maybe, it's better to add a flags field, which would include
> "msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
> and everything else that may be optional. Does it sound sane?

If sendmsg takes the argument, it will just have to be initialized, I think.

Other functions are not aware of its existence so it can remain
uninitialized there.
Pavel Begunkov Dec. 3, 2021, 4:19 p.m. UTC | #25
On 12/2/21 21:25, Willem de Bruijn wrote:
>>> What if the ubuf pool can be found from the sk, and the index in that
>>> pool is passed as a cmsg?
>>
>> It looks to me that ubufs are by nature is something that is not
>> tightly bound to a socket (at least for io_uring API in the patchset),
>> it'll be pretty ugly:
>>
>> 1) io_uring'd need to care to register the pool in the socket. Having
>> multiple rings using the same socket would be horrible. It may be that
>> it doesn't make much sense to send in parallel from multiple rings, but
>> a per thread io_uring is a popular solution, and then someone would
>> want to pass a socket from one thread to another and we'd need to support
>> it.
>>
>> 2) And io_uring would also need to unregister it, so the pool would
>> store a list of sockets where it's used, and so referencing sockets
>> and then we need to bind it somehow to io_uring fixed files or
>> register all that for tracking referencing circular dependencies.
>>
>> 3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
>> but if so I don't like exposing basically io_uring's referencing through
>> cmsg. And it sounds io_uring would need to parse cmsg then.
>>
>>
>> A lot of nuances :) I'd really prefer to pass it on per-request basis,
> 
> Ok
> 
>> it's much cleaner, but still haven't got what's up with msghdr
>> initialisation...
> 
> And passing the struct through multiple layers of functions.

If you refer to ip_make_skb(ubuf) -> __ip_append_data(ubuf), I agree
it's a bit messier, will see what can be done. If you're about
msghdr::msg_ubuf, for me it's more like passing a callback,
which sounds like a normal thing to do.


>> Maybe, it's better to add a flags field, which would include
>> "msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
>> and everything else that may be optional. Does it sound sane?
> 
> If sendmsg takes the argument, it will just have to be initialized, I think.
> 
> Other functions are not aware of its existence so it can remain
> uninitialized there.

Got it, need to double check, but looks something like 1/12 should
be as you outlined.

And if there will be multiple optional fields that have to be
initialised, we would be able to hide all the zeroing under a
single bitmask. E.g. instead of

msg->field1 = NULL;
...
msg->fieldN = NULL;

It may look like

msg->mask = 0; // HAS_FIELD1 | HAS_FIELDN;
Willem de Bruijn Dec. 3, 2021, 4:30 p.m. UTC | #26
On Fri, Dec 3, 2021 at 11:19 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 12/2/21 21:25, Willem de Bruijn wrote:
> >>> What if the ubuf pool can be found from the sk, and the index in that
> >>> pool is passed as a cmsg?
> >>
> >> It looks to me that ubufs are by nature is something that is not
> >> tightly bound to a socket (at least for io_uring API in the patchset),
> >> it'll be pretty ugly:
> >>
> >> 1) io_uring'd need to care to register the pool in the socket. Having
> >> multiple rings using the same socket would be horrible. It may be that
> >> it doesn't make much sense to send in parallel from multiple rings, but
> >> a per thread io_uring is a popular solution, and then someone would
> >> want to pass a socket from one thread to another and we'd need to support
> >> it.
> >>
> >> 2) And io_uring would also need to unregister it, so the pool would
> >> store a list of sockets where it's used, and so referencing sockets
> >> and then we need to bind it somehow to io_uring fixed files or
> >> register all that for tracking referencing circular dependencies.
> >>
> >> 3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
> >> but if so I don't like exposing basically io_uring's referencing through
> >> cmsg. And it sounds io_uring would need to parse cmsg then.
> >>
> >>
> >> A lot of nuances :) I'd really prefer to pass it on per-request basis,
> >
> > Ok
> >
> >> it's much cleaner, but still haven't got what's up with msghdr
> >> initialisation...
> >
> > And passing the struct through multiple layers of functions.
>
> If you refer to ip_make_skb(ubuf) -> __ip_append_data(ubuf), I agree
> it's a bit messier, will see what can be done. If you're about
> msghdr::msg_ubuf, for me it's more like passing a callback,
> which sounds like a normal thing to do.

Thanks, I do mean the first.

Also, small nit now that it comes up again msghdr::msg_ubuf is not
plain C. I would avoid that pseudo C++ notation (in the subject line
of 3/12)
>
> >> Maybe, it's better to add a flags field, which would include
> >> "msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
> >> and everything else that may be optional. Does it sound sane?
> >
> > If sendmsg takes the argument, it will just have to be initialized, I think.
> >
> > Other functions are not aware of its existence so it can remain
> > uninitialized there.
>
> Got it, need to double check, but looks something like 1/12 should
> be as you outlined.
>
> And if there will be multiple optional fields that have to be
> initialised, we would be able to hide all the zeroing under a
> single bitmask. E.g. instead of
>
> msg->field1 = NULL;
> ...
> msg->fieldN = NULL;
>
> It may look like
>
> msg->mask = 0; // HAS_FIELD1 | HAS_FIELDN;

Makes sense to me. This patch series only adds one field, so you can
leave the optimization for a possible future separate patch series?