mbox series

[PATCHSET,v4,0/9] Improve MSG_RING DEFER_TASKRUN performance

Message ID 20240618185631.71781-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Improve MSG_RING DEFER_TASKRUN performance | expand

Message

Jens Axboe June 18, 2024, 6:48 p.m. UTC
Hi,

Hi,

For v1 and replies to that and tons of perf measurements, go here:

https://lore.kernel.org/io-uring/3d553205-0fe2-482e-8d4c-a4a1ad278893@kernel.dk/

and find v2 here:

https://lore.kernel.org/io-uring/20240530152822.535791-2-axboe@kernel.dk/

and v3 here:

https://lore.kernel.org/io-uring/20240605141933.11975-1-axboe@kernel.dk/

and you can find the git tree here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-msg-ring.1

and the silly test app being used here:

https://kernel.dk/msg-lat.c

Patches are based on top of the pending 6.11 io_uring changes.

tldr is that this series greatly improves both latency, overhead, and
throughput of sending messages to other rings. It's done by using the
existing io_uring task_work for passing messages, rather than utilize
the rather big hammer of TWA_SIGNAL based generic kernel task_work.
Note that this differs from v3 of this posting, as that used the
CQE overflow approach. While the CQE overflow approach still performs
a bit better than this approach, this one is a bit cleaner.

Performance for local (same node CPUs) message passing before this
change:

init_flags=3000, delay=10 usec
latencies for: receiver (msg=82631)
    percentiles (nsec):
     |  1.0000th=[ 3088],  5.0000th=[ 3088], 10.0000th=[ 3120],
     | 20.0000th=[ 3248], 30.0000th=[ 3280], 40.0000th=[ 3312],
     | 50.0000th=[ 3408], 60.0000th=[ 3440], 70.0000th=[ 3472],
     | 80.0000th=[ 3504], 90.0000th=[ 3600], 95.0000th=[ 3696],
     | 99.0000th=[ 6368], 99.5000th=[ 6496], 99.9000th=[ 6880],
     | 99.9500th=[ 7008], 99.9900th=[12352]
latencies for: sender (msg=82631)
    percentiles (nsec):
     |  1.0000th=[ 5280],  5.0000th=[ 5280], 10.0000th=[ 5344],
     | 20.0000th=[ 5408], 30.0000th=[ 5472], 40.0000th=[ 5472],
     | 50.0000th=[ 5600], 60.0000th=[ 5600], 70.0000th=[ 5664],
     | 80.0000th=[ 5664], 90.0000th=[ 5792], 95.0000th=[ 5920],
     | 99.0000th=[ 8512], 99.5000th=[ 8640], 99.9000th=[ 8896],
     | 99.9500th=[ 9280], 99.9900th=[19840]

and after:

init_flags=3000, delay=10 usec
Latencies for: Sender (msg=236763)
    percentiles (nsec):
     |  1.0000th=[  225],  5.0000th=[  245], 10.0000th=[  278],
     | 20.0000th=[  294], 30.0000th=[  330], 40.0000th=[  378],
     | 50.0000th=[  418], 60.0000th=[  466], 70.0000th=[  524],
     | 80.0000th=[  604], 90.0000th=[  708], 95.0000th=[  804],
     | 99.0000th=[ 1864], 99.5000th=[ 2480], 99.9000th=[ 2768],
     | 99.9500th=[ 2864], 99.9900th=[ 3056]
Latencies for: Receiver (msg=236763)
    percentiles (nsec):
     |  1.0000th=[  764],  5.0000th=[  940], 10.0000th=[ 1096],
     | 20.0000th=[ 1416], 30.0000th=[ 1736], 40.0000th=[ 2040],
     | 50.0000th=[ 2352], 60.0000th=[ 2704], 70.0000th=[ 3152],
     | 80.0000th=[ 3856], 90.0000th=[ 4960], 95.0000th=[ 6176],
     | 99.0000th=[ 8032], 99.5000th=[ 8256], 99.9000th=[ 8768],
     | 99.9500th=[10304], 99.9900th=[91648]

and for remote (different nodes) CPUs, before:

init_flags=3000, delay=10 usec
Latencies for: Receiver (msg=44002)
    percentiles (nsec):
     |  1.0000th=[ 7264],  5.0000th=[ 8384], 10.0000th=[ 8512],
     | 20.0000th=[ 8640], 30.0000th=[ 8896], 40.0000th=[ 9024],
     | 50.0000th=[ 9152], 60.0000th=[ 9280], 70.0000th=[ 9408],
     | 80.0000th=[ 9536], 90.0000th=[ 9792], 95.0000th=[ 9920],
     | 99.0000th=[10304], 99.5000th=[13376], 99.9000th=[19840],
     | 99.9500th=[20608], 99.9900th=[25728]
Latencies for: Sender (msg=44002)
    percentiles (nsec):
     |  1.0000th=[11712],  5.0000th=[12864], 10.0000th=[12864],
     | 20.0000th=[13120], 30.0000th=[13248], 40.0000th=[13376],
     | 50.0000th=[13504], 60.0000th=[13760], 70.0000th=[13888],
     | 80.0000th=[14144], 90.0000th=[14272], 95.0000th=[14400],
     | 99.0000th=[15936], 99.5000th=[21632], 99.9000th=[24704],
     | 99.9500th=[25984], 99.9900th=[37632]

and after the changes:

init_flags=3000, delay=10 usec
Latencies for: Sender (msg=192598)
    percentiles (nsec):
     |  1.0000th=[  402],  5.0000th=[  430], 10.0000th=[  446],
     | 20.0000th=[  482], 30.0000th=[  700], 40.0000th=[  804],
     | 50.0000th=[  932], 60.0000th=[ 1176], 70.0000th=[ 1304],
     | 80.0000th=[ 1480], 90.0000th=[ 1752], 95.0000th=[ 2128],
     | 99.0000th=[ 2736], 99.5000th=[ 2928], 99.9000th=[ 4256],
     | 99.9500th=[ 8768], 99.9900th=[12864]
Latencies for: Receiver (msg=192598)
    percentiles (nsec):
     |  1.0000th=[ 2024],  5.0000th=[ 2544], 10.0000th=[ 2928],
     | 20.0000th=[ 3600], 30.0000th=[ 4048], 40.0000th=[ 4448],
     | 50.0000th=[ 4896], 60.0000th=[ 5408], 70.0000th=[ 5920],
     | 80.0000th=[ 6752], 90.0000th=[ 7904], 95.0000th=[ 9408],
     | 99.0000th=[10816], 99.5000th=[11712], 99.9000th=[16320],
     | 99.9500th=[18304], 99.9900th=[22656]

 include/linux/io_uring_types.h |   3 +
 io_uring/io_uring.c            |  53 ++++++++++++---
 io_uring/io_uring.h            |   3 +
 io_uring/msg_ring.c            | 119 ++++++++++++++++++++-------------
 io_uring/msg_ring.h            |   1 +
 5 files changed, 124 insertions(+), 55 deletions(-)

Since v3:
- Switch back to task_work approach, rather than utilize overflows
  for this
- Retain old task_work approach for fd passing
- Various tweaks and cleanups