mbox series

[RFC,v2,00/19] io_uring zerocopy tx

Message ID cover.1640029579.git.asml.silence@gmail.com (mailing list archive)
Headers show
Series io_uring zerocopy tx | expand

Message

Pavel Begunkov Dec. 21, 2021, 3:35 p.m. UTC
Update on io_uring zerocopy tx, still RFC. For v1 and design notes see

https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

Absolute numbers (against dummy) got higher since v1, + ~10-12% requests/s for
the peak performance case. 5/19 brought a couple of percents, but most of it
came with 8/19 and 9/19 (+8-11% in numbers, 5-7% in profiles). It will also
be needed in the future for p2p. Any reason not to do alike for paged non-zc?
Small (under 100-150B) packets?

Most of checks are removed from non-zc paths. Implemented a bit trickier in
__ip_append_data(), but considering already existing assumptions around "from"
argument it should be fine.

Benchmarks for dummy netdev, UDP/IPv4, payload size=4096:
 -n<N> is how many requests we submit per syscall. From io_uring perspective -n1
       is wasteful and far from optimal, but included for comparison.
 -z0   disables zerocopy, just normal io_uring send requests
 -f    makes to flush "buffer free" notifications for every request

                        | K reqs/s | speedup
msg_zerocopy (non-zc)   | 1120     | 1.12
msg_zerocopy (zc)       | 997      | 1
io_uring -n1 -z0        | 1469     | 1.47
io_uring -n8 -z0        | 1780     | 1.78
io_uring -n1 -f         | 1688     | 1.69
io_uring -n1            | 1774     | 1.77
io_uring -n8 -f         | 2075     | 2.08
io_uring -n8            | 2265     | 2.27

note: it might be not too interesting to compare zc vs non-zc, the performance
relative difference can be shifted in favour of zerocopy by cutting constant
per-request overhead, and there are easy ways of doing that, e.g. by compiling
out unused features. Even more true for the table below as there was additional
noise taking a good quarter of CPU cycles.

Some data for UDP/IPv6 between a pair of NICs. 9/19 wasn't there at the time of
testing. All tests are CPU bound and so as expected reqs/s for zerocopy doesn't
vary much between different payload sizes. io_uring to msg_zerocopy ratio is not
too representative for reasons similar to described above.

payload | test                   | K reqs/s
___________________________________________ 
 8192   | io_uring -n8 (dummy)   | 599
        | io_uring -n1 -z0       | 264
        | io_uring -n8 -z0       | 302
        | msg_zerocopy           | 248
        | msg_zerocopy -z        | 183
        | io_uring -n1 -f        | 306
        | io_uring -n1           | 318
        | io_uring -n8 -f        | 373
        | io_uring -n8           | 401

 4096   | io_uring -n8 (dummy)   | 601
        | io_uring -n1 -z0       | 303
        | io_uring -n8 -z0       | 366
        | msg_zerocopy           | 278
        | msg_zerocopy -z        | 187
        | io_uring -n1 -f        | 317
        | io_uring -n1           | 325
        | io_uring -n8 -f        | 387
        | io_uring -n8           | 405

 1024   | io_uring -n8 (dummy)   | 601
        | io_uring -n1 -z0       | 329
        | io_uring -n8 -z0       | 407
        | msg_zerocopy           | 301
        | msg_zerocopy -z        | 186
        | io_uring -n1 -f        | 317
        | io_uring -n1           | 327
        | io_uring -n8 -f        | 390
        | io_uring -n8           | 403

 512    | io_uring -n8 (dummy)   | 601
        | io_uring -n1 -z0       | 340
        | io_uring -n8 -z0       | 417
        | msg_zerocopy           | 310
        | msg_zerocopy -z        | 186
        | io_uring -n1 -f        | 317
        | io_uring -n1           | 328
        | io_uring -n8 -f        | 392
        | io_uring -n8           | 406

 128    | io_uring -n8 (dummy)   | 602
        | io_uring -n1 -z0       | 341
        | io_uring -n8 -z0       | 428
        | msg_zerocopy           | 317
        | msg_zerocopy -z        | 188
        | io_uring -n1 -f        | 318
        | io_uring -n1           | 331
        | io_uring -n8 -f        | 391
        | io_uring -n8           | 408

https://github.com/isilence/linux/tree/zc_v2
https://github.com/isilence/liburing/tree/zc_v2

The Benchmark is <liburing>/test/send-zc,

send-zc [-f] [-n<N>] [-z0] -s<payload size> -D<dst ip> (-6|-4) [-t<sec>] udp

As a server you can use msg_zerocopy from in kernel's selftests, or a copy of
it at <liburing>/test/msg_zerocopy. No server is needed for dummy testing.

dummy setup:
sudo ip li add dummy0 type dummy && sudo ip li set dummy0 up mtu 65536
# make traffic for the specified IP to go through dummy0
sudo ip route add <ip_address> dev dummy0

v2: remove additional overhead for non-zc from skb_release_data() (Jonathan)
    avoid msg propagation, hide extra bits of non-zc overhead
    task_work based "buffer free" notifications
    improve io_uring's notification refcounting
    added 5/19, (no pfmemalloc tracking)
    added 8/19 and 9/19 preventing small copies with zc
    misc small changes

Pavel Begunkov (19):
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: pass a struct ubuf_info in msghdr
  net: add zerocopy_sg_from_iter for bvec
  net: optimise page get/free for bvec zc
  net: don't track pfmemalloc for zc registered mem
  ipv4/udp: add support msgdr::msg_ubuf
  ipv6/udp: add support msgdr::msg_ubuf
  ipv4: avoid partial copy for zc
  ipv6: avoid partial copy for zc
  io_uring: add send notifiers registration
  io_uring: infrastructure for send zc notifications
  io_uring: wire send zc request type
  io_uring: add an option to flush zc notifications
  io_uring: opcode independent fixed buf import
  io_uring: sendzc with fixed buffers
  io_uring: cache struct ubuf_info
  io_uring: unclog ctx refs waiting with zc notifiers
  io_uring: task_work for notification delivery
  io_uring: optimise task referencing by notifiers

 fs/io_uring.c                 | 440 +++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h        |  46 ++--
 include/linux/socket.h        |   1 +
 include/uapi/linux/io_uring.h |  14 ++
 net/compat.c                  |   1 +
 net/core/datagram.c           |  58 +++++
 net/core/skbuff.c             |  16 +-
 net/ipv4/ip_output.c          |  55 +++--
 net/ipv6/ip6_output.c         |  54 ++++-
 net/socket.c                  |   3 +
 10 files changed, 633 insertions(+), 55 deletions(-)

Comments

Pavel Begunkov Dec. 21, 2021, 3:43 p.m. UTC | #1
On 12/21/21 15:35, Pavel Begunkov wrote:
> Update on io_uring zerocopy tx, still RFC. For v1 and design notes see
[...]
> 
> Benchmarks for dummy netdev, UDP/IPv4, payload size=4096:
>   -n<N> is how many requests we submit per syscall. From io_uring perspective -n1
>         is wasteful and far from optimal, but included for comparison.
>   -z0   disables zerocopy, just normal io_uring send requests
>   -f    makes to flush "buffer free" notifications for every request
[...]> https://github.com/isilence/linux/tree/zc_v2
> https://github.com/isilence/liburing/tree/zc_v2
> 
> The Benchmark is <liburing>/test/send-zc,
> 
> send-zc [-f] [-n<N>] [-z0] -s<payload size> -D<dst ip> (-6|-4) [-t<sec>] udp

Attaching a standalone send-zc for convenience.

gcc -luring -O2 -o send-zc ./send-zc.c