[for-next,0/2] Enable IOU_F_TWQ_LAZY_WAKE for passthrough

Message ID	cover.1684154817.git.asml.silence@gmail.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Pavel Begunkov <asml.silence@gmail.com> To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, io-uring@vger.kernel.org, axboe@kernel.dk Cc: kbusch@kernel.org, hch@lst.de, sagi@grimberg.me, joshi.k@samsung.com, Pavel Begunkov <asml.silence@gmail.com> Subject: [PATCH for-next 0/2] Enable IOU_F_TWQ_LAZY_WAKE for passthrough Date: Mon, 15 May 2023 13:54:41 +0100 Message-Id: <cover.1684154817.git.asml.silence@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Enable IOU_F_TWQ_LAZY_WAKE for passthrough \| expand [for-next,0/2] Enable IOU_F_TWQ_LAZY_WAKE for passthrough [for-next,1/2] io_uring/cmd: add cmd lazy tw wake helper [for-next,2/2] nvme: optimise io_uring passthrough completion

Message ID

cover.1684154817.git.asml.silence@gmail.com (mailing list archive)

Headers

From: Pavel Begunkov <asml.silence@gmail.com>
To: linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
        io-uring@vger.kernel.org, axboe@kernel.dk
Cc: kbusch@kernel.org, hch@lst.de, sagi@grimberg.me,
        joshi.k@samsung.com, Pavel Begunkov <asml.silence@gmail.com>
Subject: [PATCH for-next 0/2] Enable IOU_F_TWQ_LAZY_WAKE for passthrough
Date: Mon, 15 May 2023 13:54:41 +0100
Message-Id: <cover.1684154817.git.asml.silence@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Enable IOU_F_TWQ_LAZY_WAKE for passthrough | expand

Message

Pavel Begunkov May 15, 2023, 12:54 p.m. UTC

Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough.

The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1]
patchset, but for a quick test I took fio/t/io_uring with 4 threads each
reading their own drive and all pinned to the same CPU to make it CPU
bound and got +10% throughput improvement.

[1] https://lore.kernel.org/all/cover.1680782016.git.asml.silence@gmail.com/

Pavel Begunkov (2):
  io_uring/cmd: add cmd lazy tw wake helper
  nvme: optimise io_uring passthrough completion

 drivers/nvme/host/ioctl.c |  4 ++--
 include/linux/io_uring.h  | 18 ++++++++++++++++--
 io_uring/uring_cmd.c      | 16 ++++++++++++----
 3 files changed, 30 insertions(+), 8 deletions(-)


base-commit: 9a48d604672220545d209e9996c2a1edbb5637f6

Comments

Anuj gupta May 16, 2023, 11:42 a.m. UTC | #1

On Mon, May 15, 2023 at 6:29 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough.
>
> The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1]
> patchset, but for a quick test I took fio/t/io_uring with 4 threads each
> reading their own drive and all pinned to the same CPU to make it CPU
> bound and got +10% throughput improvement.
>
> [1] https://lore.kernel.org/all/cover.1680782016.git.asml.silence@gmail.com/
>
> Pavel Begunkov (2):
>   io_uring/cmd: add cmd lazy tw wake helper
>   nvme: optimise io_uring passthrough completion
>
>  drivers/nvme/host/ioctl.c |  4 ++--
>  include/linux/io_uring.h  | 18 ++++++++++++++++--
>  io_uring/uring_cmd.c      | 16 ++++++++++++----
>  3 files changed, 30 insertions(+), 8 deletions(-)
>
>
> base-commit: 9a48d604672220545d209e9996c2a1edbb5637f6
> --
> 2.40.0
>

I tried to run a few workloads on my setup with your patches applied. However, I
couldn't see any difference in io passthrough performance. I might have missed
something. Can you share the workload that you ran which gave you the perf
improvement. Here is the workload that I ran -

Without your patches applied -

# taskset -c 0 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0 -O0
-u1 -n1 /dev/ng0n1
submitter=0, tid=2049, file=/dev/ng0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=2.83M, BW=1382MiB/s, IOS/call=16/15
IOPS=2.82M, BW=1379MiB/s, IOS/call=16/16
IOPS=2.84M, BW=1388MiB/s, IOS/call=16/15
Exiting on timeout
Maximum IOPS=2.84M

# taskset -c 0,3 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0
-O0 -u1 -n2 /dev/ng0n1 /dev/ng1n1
submitter=0, tid=2046, file=/dev/ng0n1, node=-1
submitter=1, tid=2047, file=/dev/ng1n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=5.72M, BW=2.79GiB/s, IOS/call=16/15
IOPS=5.71M, BW=2.79GiB/s, IOS/call=16/16
IOPS=5.70M, BW=2.78GiB/s, IOS/call=16/15
Exiting on timeout Maximum IOPS=5.72M

With your patches applied -

# taskset -c 0 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0 -O0
-u1 -n1 /dev/ng0n1
submitter=0, tid=2032, file=/dev/ng0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=2.83M, BW=1381MiB/s, IOS/call=16/15
IOPS=2.83M, BW=1379MiB/s, IOS/call=16/15
IOPS=2.83M, BW=1383MiB/s, IOS/call=15/15
Exiting on timeout Maximum IOPS=2.83M

# taskset -c 0,3 t/io_uring -r4 -b512 -d64 -c16 -s16 -p0 -F1 -B1 -P0
-O0 -u1 -n2 /dev/ng0n1 /dev/ng1n1
submitter=1, tid=2037, file=/dev/ng1n1, node=-1
submitter=0, tid=2036, file=/dev/ng0n1, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=1, QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=5.64M, BW=2.75GiB/s, IOS/call=15/15
IOPS=5.62M, BW=2.75GiB/s, IOS/call=16/16
IOPS=5.62M, BW=2.74GiB/s, IOS/call=16/16
Exiting on timeout Maximum IOPS=5.64M

--
Anuj Gupta

Pavel Begunkov May 16, 2023, 6:38 p.m. UTC | #2

On 5/16/23 12:42, Anuj gupta wrote:
> On Mon, May 15, 2023 at 6:29 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough.
>>
>> The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1]
>> patchset, but for a quick test I took fio/t/io_uring with 4 threads each
>> reading their own drive and all pinned to the same CPU to make it CPU
>> bound and got +10% throughput improvement.
>>
>> [1] https://lore.kernel.org/all/cover.1680782016.git.asml.silence@gmail.com/
>>
>> Pavel Begunkov (2):
>>    io_uring/cmd: add cmd lazy tw wake helper
>>    nvme: optimise io_uring passthrough completion
>>
>>   drivers/nvme/host/ioctl.c |  4 ++--
>>   include/linux/io_uring.h  | 18 ++++++++++++++++--
>>   io_uring/uring_cmd.c      | 16 ++++++++++++----
>>   3 files changed, 30 insertions(+), 8 deletions(-)
>>
>>
>> base-commit: 9a48d604672220545d209e9996c2a1edbb5637f6
>> --
>> 2.40.0
>>
> 
> I tried to run a few workloads on my setup with your patches applied. However, I
> couldn't see any difference in io passthrough performance. I might have missed
> something. Can you share the workload that you ran which gave you the perf
> improvement. Here is the workload that I ran -

The patch is way to make completion batching more consistent. If you're so
lucky that all IO complete before task_work runs, it'll be perfect batching
and there is nothing to improve. That often happens with high throughput
benchmarks because of how consistent they are: no writes, same size,
everything is issued at the same time and so on. In reality it depends
on your use pattern, timings, nvme coalescing, will also change if you
introduce a second drive, and so on.

With the patch t/io_uring should run task_work once for exactly the
number of cqes the user is waiting for, i.e. -c<N>, regardless of
circumstances.

Just tried it out to confirm,

taskset -c 0 nice -n -20 /t/io_uring -p0 -d4 -b8192 -s4 -c4 -F1 -B1 -R0 -X1 -u1 -O0 /dev/ng0n1

Without:
12:11:10 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:11:20 PM    0    2.03    0.00   25.95    0.00    0.00    0.00    0.00    0.00    0.00   72.03
With:
12:12:00 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
12:12:10 PM    0    2.22    0.00   17.39    0.00    0.00    0.00    0.00    0.00    0.00   80.40


Double checking it works:

echo 1 > /sys/kernel/debug/tracing/events/io_uring/io_uring_local_work_run/enable
cat /sys/kernel/debug/tracing/trace_pipe

Without I see

io_uring-4108    [000] .....   653.820369: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108    [000] .....   653.820371: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108    [000] .....   653.820382: io_uring_local_work_run: ring 00000000b843f57f, count 2, loops 1
io_uring-4108    [000] .....   653.820383: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108    [000] .....   653.820386: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108    [000] .....   653.820398: io_uring_local_work_run: ring 00000000b843f57f, count 2, loops 1
io_uring-4108    [000] .....   653.820398: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1

And with patches it's strictly count=4.

Another way would be to add more SSDs to the picture and hope they don't
conspire to complete at the same time

Jens Axboe May 25, 2023, 2:54 p.m. UTC | #3

On Mon, 15 May 2023 13:54:41 +0100, Pavel Begunkov wrote:
> Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough.
> 
> The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1]
> patchset, but for a quick test I took fio/t/io_uring with 4 threads each
> reading their own drive and all pinned to the same CPU to make it CPU
> bound and got +10% throughput improvement.
> 
> [...]

Applied, thanks!

[1/2] io_uring/cmd: add cmd lazy tw wake helper
      commit: 5f3139fc46993b2d653a7aa5cdfe66a91881fd06
[2/2] nvme: optimise io_uring passthrough completion
      commit: f026be0e1e881e3395c3d5418ffc8c2a2203c3f3

Best regards,