[RFC,v2,00/13] CQ waiting and wake up optimisations

Message ID	cover.1672713341.git.asml.silence@gmail.com (mailing list archive)
Headers	show Return-Path: <io-uring-owner@vger.kernel.org> From: Pavel Begunkov <asml.silence@gmail.com> To: io-uring@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>, asml.silence@gmail.com Subject: [RFC v2 00/13] CQ waiting and wake up optimisations Date: Tue, 3 Jan 2023 03:03:51 +0000 Message-Id: <cover.1672713341.git.asml.silence@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	CQ waiting and wake up optimisations \| expand [RFC,v2,00/13] CQ waiting and wake up optimisations [RFC,v2,01/13] io_uring: rearrange defer list checks [RFC,v2,02/13] io_uring: don't iterate cq wait fast path [RFC,v2,03/13] io_uring: kill io_run_task_work_ctx [RFC,v2,04/13] io_uring: move defer tw task checks [RFC,v2,05/13] io_uring: parse check_cq out of wq waiting [RFC,v2,06/13] io_uring: mimimise io_cqring_wait_schedule [RFC,v2,07/13] io_uring: simplify io_has_work [RFC,v2,08/13] io_uring: set TASK_RUNNING right after schedule [RFC,v2,09/13] io_uring: separate wq for ring polling [RFC,v2,10/13] io_uring: add lazy poll_wq activation [RFC,v2,11/13] io_uring: wake up optimisations [RFC,v2,12/13] io_uring: waitqueue-less cq waiting [RFC,v2,13/13] io_uring: add io_req_local_work_add wake fast path

Message ID

cover.1672713341.git.asml.silence@gmail.com (mailing list archive)

Headers

From: Pavel Begunkov <asml.silence@gmail.com>
To: io-uring@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>, asml.silence@gmail.com
Subject: [RFC v2 00/13] CQ waiting and wake up optimisations
Date: Tue,  3 Jan 2023 03:03:51 +0000
Message-Id: <cover.1672713341.git.asml.silence@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

CQ waiting and wake up optimisations | expand

Message

Pavel Begunkov Jan. 3, 2023, 3:03 a.m. UTC

The series replaces waitqueues for CQ waiting with a custom waiting
loop and adds a couple more perf tweak around it. Benchmarking is done
for QD1 with simulated tw arrival right after we start waiting, it
gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for
the in-kernel io_uring overhead (i.e. without syscall and userspace).
That matches profiles, wake_up() _without_ wake_up_state() was taking
12-14% and prepare_to_wait_exclusive() was around 4-6%.

Another 15% reported in the v1 are not there as it got optimised in the
meanwhile by 52ea806ad9834 ("io_uring: finish waiting before flushing
overflow entries"). So, comparing to a couple of weeks ago the perf
of this test case should've jumped more than 30% end-to-end. (Again,
spend only half of cycles in io_uring kernel code).

1-8 are preparation patches, they might be taken right away. The rest
needs more comments and maybe a little brushing.

Pavel Begunkov (13):
  io_uring: rearrange defer list checks
  io_uring: don't iterate cq wait fast path
  io_uring: kill io_run_task_work_ctx
  io_uring: move defer tw task checks
  io_uring: parse check_cq out of wq waiting
  io_uring: mimimise io_cqring_wait_schedule
  io_uring: simplify io_has_work
  io_uring: set TASK_RUNNING right after schedule
  io_uring: separate wq for ring polling
  io_uring: add lazy poll_wq activation
  io_uring: wake up optimisations
  io_uring: waitqueue-less cq waiting
  io_uring: add io_req_local_work_add wake fast path

 include/linux/io_uring_types.h |   4 +
 io_uring/io_uring.c            | 194 +++++++++++++++++++++++----------
 io_uring/io_uring.h            |  35 +++---
 3 files changed, 155 insertions(+), 78 deletions(-)

Comments

Jens Axboe Jan. 4, 2023, 6:05 p.m. UTC | #1

On Tue, 03 Jan 2023 03:03:51 +0000, Pavel Begunkov wrote:
> The series replaces waitqueues for CQ waiting with a custom waiting
> loop and adds a couple more perf tweak around it. Benchmarking is done
> for QD1 with simulated tw arrival right after we start waiting, it
> gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for
> the in-kernel io_uring overhead (i.e. without syscall and userspace).
> That matches profiles, wake_up() _without_ wake_up_state() was taking
> 12-14% and prepare_to_wait_exclusive() was around 4-6%.
> 
> [...]

Applied, thanks!

[01/13] io_uring: rearrange defer list checks
        commit: 9617404e5d86e9cfb2da4ac2b17e99a72836bf69
[02/13] io_uring: don't iterate cq wait fast path
        commit: 1329dc7e79da3570f6591d9997bd2fe3a7d17ca6
[03/13] io_uring: kill io_run_task_work_ctx
        commit: 90b8457304e25a137c1b8c89f7cae276b79d3273
[04/13] io_uring: move defer tw task checks
        commit: 1345a6b381b4d39b15a1e34c0a78be2ee2e452c6
[05/13] io_uring: parse check_cq out of wq waiting
        commit: b5be9ebe91246b67d4b0dee37e3071d73ba69119
[06/13] io_uring: mimimise io_cqring_wait_schedule
        commit: de254b5029fa37c4e0a6a16743fa2271fa524fc7
[07/13] io_uring: simplify io_has_work
        commit: 26736d171ec54487de677f09d682d144489957fa
[08/13] io_uring: set TASK_RUNNING right after schedule
        commit: 8214ccccf64f1335b34b98ed7deb2c6c29969c49

Best regards,

Pavel Begunkov Jan. 4, 2023, 8:25 p.m. UTC | #2

On 1/3/23 03:03, Pavel Begunkov wrote:
> The series replaces waitqueues for CQ waiting with a custom waiting
> loop and adds a couple more perf tweak around it. Benchmarking is done
> for QD1 with simulated tw arrival right after we start waiting, it
> gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for
> the in-kernel io_uring overhead (i.e. without syscall and userspace).
> That matches profiles, wake_up() _without_ wake_up_state() was taking
> 12-14% and prepare_to_wait_exclusive() was around 4-6%.

The numbers are gathered with an in-kernel trick. Tried to quickly
measure without it:

modprobe null_blk no_sched=1 irqmode=2 completion_nsec=0
taskset -c 0 fio/t/io_uring -d1 -s1 -c1 -p0 -B1 -F1 -X -b512 -n4 /dev/nullb0

The important part here is using timers-backed nullblk and pinning
multiple workers to a single CPU. -n4 was enough for me to keep
the CPU 100% busy.

old:
IOPS=539.51K, BW=2.11GiB/s, IOS/call=1/1
IOPS=542.26K, BW=2.12GiB/s, IOS/call=1/1
IOPS=540.73K, BW=2.11GiB/s, IOS/call=1/1
IOPS=541.28K, BW=2.11GiB/s, IOS/call=0/0

new:
IOPS=561.85K, BW=2.19GiB/s, IOS/call=1/1
IOPS=561.58K, BW=2.19GiB/s, IOS/call=1/1
IOPS=561.56K, BW=2.19GiB/s, IOS/call=1/1
IOPS=559.94K, BW=2.19GiB/s, IOS/call=1/1

The different is only ~3.5%  because of huge additional overhead
for nullb timers, block qos and other unnecessary bits.

P.S. tested with an out-of-tree patch adding a flag enabling/disabling
the feature to remove variance b/w reboots.