Message ID | cover.1684154817.git.asml.silence@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Enable IOU_F_TWQ_LAZY_WAKE for passthrough | expand |
On Mon, May 15, 2023 at 6:29 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > > Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough. > > The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1] > patchset, but for a quick test I took fio/t/io_uring with 4 threads each > reading their own drive and all pinned to the same CPU to make it CPU > bound and got +10% throughput improvement. > > [1] https://lore.kernel.org/all/cover.1680782016.git.asml.silence@gmail.com/ > > Pavel Begunkov (2): > io_uring/cmd: add cmd lazy tw wake helper > nvme: optimise io_uring passthrough completion > > drivers/nvme/host/ioctl.c | 4 ++-- > include/linux/io_uring.h | 18 ++++++++++++++++-- > io_uring/uring_cmd.c | 16 ++++++++++++---- > 3 files changed, 30 insertions(+), 8 deletions(-) > > > base-commit: 9a48d604672220545d209e9996c2a1edbb5637f6 > -- > 2.40.0 > I tried to run a few workloads on my setup with your patches applied. However, I couldn't see any difference in io passthrough performance. I might have missed something. Can you share the workload that you ran which gave you the perf improvement. Here is the workload that I ran - Without your patches applied - # taskset -c 0 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0 -O0 -u1 -n1 /dev/ng0n1 submitter=0, tid=2049, file=/dev/ng0n1, node=-1 polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64 Engine=io_uring, sq_ring=64, cq_ring=64 IOPS=2.83M, BW=1382MiB/s, IOS/call=16/15 IOPS=2.82M, BW=1379MiB/s, IOS/call=16/16 IOPS=2.84M, BW=1388MiB/s, IOS/call=16/15 Exiting on timeout Maximum IOPS=2.84M # taskset -c 0,3 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0 -O0 -u1 -n2 /dev/ng0n1 /dev/ng1n1 submitter=0, tid=2046, file=/dev/ng0n1, node=-1 submitter=1, tid=2047, file=/dev/ng1n1, node=-1 polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64 Engine=io_uring, sq_ring=64, cq_ring=64 IOPS=5.72M, BW=2.79GiB/s, IOS/call=16/15 IOPS=5.71M, BW=2.79GiB/s, IOS/call=16/16 IOPS=5.70M, BW=2.78GiB/s, IOS/call=16/15 Exiting on timeout Maximum IOPS=5.72M With your patches applied - # taskset -c 0 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0 -O0 -u1 -n1 /dev/ng0n1 submitter=0, tid=2032, file=/dev/ng0n1, node=-1 polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64 Engine=io_uring, sq_ring=64, cq_ring=64 IOPS=2.83M, BW=1381MiB/s, IOS/call=16/15 IOPS=2.83M, BW=1379MiB/s, IOS/call=16/15 IOPS=2.83M, BW=1383MiB/s, IOS/call=15/15 Exiting on timeout Maximum IOPS=2.83M # taskset -c 0,3 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0 -O0 -u1 -n2 /dev/ng0n1 /dev/ng1n1 submitter=1, tid=2037, file=/dev/ng1n1, node=-1 submitter=0, tid=2036, file=/dev/ng0n1, node=-1 polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64 Engine=io_uring, sq_ring=64, cq_ring=64 IOPS=5.64M, BW=2.75GiB/s, IOS/call=15/15 IOPS=5.62M, BW=2.75GiB/s, IOS/call=16/16 IOPS=5.62M, BW=2.74GiB/s, IOS/call=16/16 Exiting on timeout Maximum IOPS=5.64M -- Anuj Gupta
On 5/16/23 12:42, Anuj gupta wrote: > On Mon, May 15, 2023 at 6:29 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough. >> >> The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1] >> patchset, but for a quick test I took fio/t/io_uring with 4 threads each >> reading their own drive and all pinned to the same CPU to make it CPU >> bound and got +10% throughput improvement. >> >> [1] https://lore.kernel.org/all/cover.1680782016.git.asml.silence@gmail.com/ >> >> Pavel Begunkov (2): >> io_uring/cmd: add cmd lazy tw wake helper >> nvme: optimise io_uring passthrough completion >> >> drivers/nvme/host/ioctl.c | 4 ++-- >> include/linux/io_uring.h | 18 ++++++++++++++++-- >> io_uring/uring_cmd.c | 16 ++++++++++++---- >> 3 files changed, 30 insertions(+), 8 deletions(-) >> >> >> base-commit: 9a48d604672220545d209e9996c2a1edbb5637f6 >> -- >> 2.40.0 >> > > I tried to run a few workloads on my setup with your patches applied. However, I > couldn't see any difference in io passthrough performance. I might have missed > something. Can you share the workload that you ran which gave you the perf > improvement. Here is the workload that I ran - The patch is way to make completion batching more consistent. If you're so lucky that all IO complete before task_work runs, it'll be perfect batching and there is nothing to improve. That often happens with high throughput benchmarks because of how consistent they are: no writes, same size, everything is issued at the same time and so on. In reality it depends on your use pattern, timings, nvme coalescing, will also change if you introduce a second drive, and so on. With the patch t/io_uring should run task_work once for exactly the number of cqes the user is waiting for, i.e. -c<N>, regardless of circumstances. Just tried it out to confirm, taskset -c 0 nice -n -20 /t/io_uring -p0 -d4 -b8192 -s4 -c4 -F1 -B1 -R0 -X1 -u1 -O0 /dev/ng0n1 Without: 12:11:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 12:11:20 PM 0 2.03 0.00 25.95 0.00 0.00 0.00 0.00 0.00 0.00 72.03 With: 12:12:00 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 12:12:10 PM 0 2.22 0.00 17.39 0.00 0.00 0.00 0.00 0.00 0.00 80.40 Double checking it works: echo 1 > /sys/kernel/debug/tracing/events/io_uring/io_uring_local_work_run/enable cat /sys/kernel/debug/tracing/trace_pipe Without I see io_uring-4108 [000] ..... 653.820369: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1 io_uring-4108 [000] ..... 653.820371: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1 io_uring-4108 [000] ..... 653.820382: io_uring_local_work_run: ring 00000000b843f57f, count 2, loops 1 io_uring-4108 [000] ..... 653.820383: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1 io_uring-4108 [000] ..... 653.820386: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1 io_uring-4108 [000] ..... 653.820398: io_uring_local_work_run: ring 00000000b843f57f, count 2, loops 1 io_uring-4108 [000] ..... 653.820398: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1 And with patches it's strictly count=4. Another way would be to add more SSDs to the picture and hope they don't conspire to complete at the same time
On Mon, 15 May 2023 13:54:41 +0100, Pavel Begunkov wrote: > Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough. > > The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1] > patchset, but for a quick test I took fio/t/io_uring with 4 threads each > reading their own drive and all pinned to the same CPU to make it CPU > bound and got +10% throughput improvement. > > [...] Applied, thanks! [1/2] io_uring/cmd: add cmd lazy tw wake helper commit: 5f3139fc46993b2d653a7aa5cdfe66a91881fd06 [2/2] nvme: optimise io_uring passthrough completion commit: f026be0e1e881e3395c3d5418ffc8c2a2203c3f3 Best regards,