[00/16] block optimisation round

Message ID	cover.1634676157.git.asml.silence@gmail.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Pavel Begunkov <asml.silence@gmail.com> To: linux-block@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk>, Pavel Begunkov <asml.silence@gmail.com> Subject: [PATCH 00/16] block optimisation round Date: Tue, 19 Oct 2021 22:24:09 +0100 Message-Id: <cover.1634676157.git.asml.silence@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	block optimisation round \| expand [00/16] block optimisation round [01/16] block: turn macro helpers into inline functions [02/16] block: convert leftovers to bdev_get_queue [03/16] block: optimise req_bio_endio() [04/16] block: don't bloat enter_queue with percpu_ref [05/16] block: inline a part of bio_release_pages() [06/16] block: clean up blk_mq_submit_bio() merging [07/16] blocK: move plug flush functions to blk-mq.c [08/16] block: optimise blk_flush_plug_list [09/16] block: optimise boundary blkdev_read_iter's checks [10/16] block: optimise blkdev_bio_end_io() [11/16] block: add optimised version bio_set_dev() [12/16] block: add single bio async direct IO helper [13/16] block: add async version of bio_set_polled [14/16] block: skip advance when async and not needed [15/16] block: optimise blk_may_split for normal rw [16/16] block: optimise submit_bio_checks for normal rw

Pavel Begunkov Oct. 19, 2021, 9:24 p.m. UTC

Jens tried out a similar series with some not yet sent additions:
8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%.

12/16 is bulky, but it nicely drives the numbers. Moreover, with
it we can rid of some not used anymore optimisations in
__blkdev_direct_IO() because it awlays serve multiple bios.
E.g. no need in conditional referencing with DIO_MULTI_BIO,
and _probably_ can be converted to chained bio.

Pavel Begunkov (16):
  block: turn macro helpers into inline functions
  block: convert leftovers to bdev_get_queue
  block: optimise req_bio_endio()
  block: don't bloat enter_queue with percpu_ref
  block: inline a part of bio_release_pages()
  block: clean up blk_mq_submit_bio() merging
  blocK: move plug flush functions to blk-mq.c
  block: optimise blk_flush_plug_list
  block: optimise boundary blkdev_read_iter's checks
  block: optimise blkdev_bio_end_io()
  block: add optimised version bio_set_dev()
  block: add single bio async direct IO helper
  block: add async version of bio_set_polled
  block: skip advance when async and not needed
  block: optimise blk_may_split for normal rw
  block: optimise submit_bio_checks for normal rw

 block/bio.c            |  20 +++----
 block/blk-core.c       | 105 ++++++++++++++--------------------
 block/blk-merge.c      |   2 +-
 block/blk-mq-sched.c   |   2 +-
 block/blk-mq-sched.h   |  12 +---
 block/blk-mq.c         |  64 ++++++++++++++-------
 block/blk-mq.h         |   1 -
 block/blk.h            |  20 ++++---
 block/fops.c           | 125 ++++++++++++++++++++++++++++++++++-------
 include/linux/bio.h    |  60 ++++++++++++++------
 include/linux/blk-mq.h |   2 -
 11 files changed, 259 insertions(+), 154 deletions(-)

Jens Axboe Oct. 19, 2021, 11:31 p.m. UTC | #1

On 10/19/21 3:24 PM, Pavel Begunkov wrote:
> Jens tried out a similar series with some not yet sent additions:
> 8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%.
> 
> 12/16 is bulky, but it nicely drives the numbers. Moreover, with
> it we can rid of some not used anymore optimisations in
> __blkdev_direct_IO() because it awlays serve multiple bios.
> E.g. no need in conditional referencing with DIO_MULTI_BIO,
> and _probably_ can be converted to chained bio.

These look good to me, only the trivial comment that patch 7
has a blocK instead of block.

I'll await further comments/reviews on these.

Jens Axboe Oct. 20, 2021, 12:21 a.m. UTC | #2

On Tue, 19 Oct 2021 22:24:09 +0100, Pavel Begunkov wrote:
> Jens tried out a similar series with some not yet sent additions:
> 8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%.
> 
> 12/16 is bulky, but it nicely drives the numbers. Moreover, with
> it we can rid of some not used anymore optimisations in
> __blkdev_direct_IO() because it awlays serve multiple bios.
> E.g. no need in conditional referencing with DIO_MULTI_BIO,
> and _probably_ can be converted to chained bio.
> 
> [...]

Applied, thanks!

[01/16] block: turn macro helpers into inline functions
        (no commit info)
[02/16] block: convert leftovers to bdev_get_queue
        (no commit info)
[03/16] block: optimise req_bio_endio()
        (no commit info)
[04/16] block: don't bloat enter_queue with percpu_ref
        (no commit info)
[05/16] block: inline a part of bio_release_pages()
        (no commit info)
[06/16] block: clean up blk_mq_submit_bio() merging
        (no commit info)
[07/16] blocK: move plug flush functions to blk-mq.c
        (no commit info)
[08/16] block: optimise blk_flush_plug_list
        (no commit info)
[09/16] block: optimise boundary blkdev_read_iter's checks
        (no commit info)
[10/16] block: optimise blkdev_bio_end_io()
        (no commit info)
[11/16] block: add optimised version bio_set_dev()
        (no commit info)
[12/16] block: add single bio async direct IO helper
        (no commit info)
[13/16] block: add async version of bio_set_polled
        (no commit info)
[14/16] block: skip advance when async and not needed
        (no commit info)
[15/16] block: optimise blk_may_split for normal rw
        (no commit info)
[16/16] block: optimise submit_bio_checks for normal rw
        (no commit info)

Best regards,

Jens Axboe Oct. 20, 2021, 12:22 a.m. UTC | #3

On 10/19/21 6:21 PM, Jens Axboe wrote:
> On Tue, 19 Oct 2021 22:24:09 +0100, Pavel Begunkov wrote:
>> Jens tried out a similar series with some not yet sent additions:
>> 8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%.
>>
>> 12/16 is bulky, but it nicely drives the numbers. Moreover, with
>> it we can rid of some not used anymore optimisations in
>> __blkdev_direct_IO() because it awlays serve multiple bios.
>> E.g. no need in conditional referencing with DIO_MULTI_BIO,
>> and _probably_ can be converted to chained bio.
>>
>> [...]
> 
> Applied, thanks!
> 
> [01/16] block: turn macro helpers into inline functions
>         (no commit info)
> [02/16] block: convert leftovers to bdev_get_queue
>         (no commit info)
> [03/16] block: optimise req_bio_endio()
>         (no commit info)
> [04/16] block: don't bloat enter_queue with percpu_ref
>         (no commit info)
> [05/16] block: inline a part of bio_release_pages()
>         (no commit info)
> [06/16] block: clean up blk_mq_submit_bio() merging
>         (no commit info)
> [07/16] blocK: move plug flush functions to blk-mq.c
>         (no commit info)
> [08/16] block: optimise blk_flush_plug_list
>         (no commit info)
> [09/16] block: optimise boundary blkdev_read_iter's checks
>         (no commit info)
> [10/16] block: optimise blkdev_bio_end_io()
>         (no commit info)
> [11/16] block: add optimised version bio_set_dev()
>         (no commit info)
> [12/16] block: add single bio async direct IO helper
>         (no commit info)
> [13/16] block: add async version of bio_set_polled
>         (no commit info)
> [14/16] block: skip advance when async and not needed
>         (no commit info)
> [15/16] block: optimise blk_may_split for normal rw
>         (no commit info)
> [16/16] block: optimise submit_bio_checks for normal rw
>         (no commit info)

Sorry, b4 got too eager on that one, they are not applied just yet.

Jens Axboe Oct. 20, 2021, 2:12 p.m. UTC | #4

On Tue, 19 Oct 2021 22:24:09 +0100, Pavel Begunkov wrote:
> Jens tried out a similar series with some not yet sent additions:
> 8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%.
> 
> 12/16 is bulky, but it nicely drives the numbers. Moreover, with
> it we can rid of some not used anymore optimisations in
> __blkdev_direct_IO() because it awlays serve multiple bios.
> E.g. no need in conditional referencing with DIO_MULTI_BIO,
> and _probably_ can be converted to chained bio.
> 
> [...]

Applied, thanks!

[01/16] block: turn macro helpers into inline functions
        (no commit info)
[02/16] block: convert leftovers to bdev_get_queue
        (no commit info)
[03/16] block: optimise req_bio_endio()
        (no commit info)
[04/16] block: don't bloat enter_queue with percpu_ref
        (no commit info)
[05/16] block: inline a part of bio_release_pages()
        (no commit info)

Best regards,

Pavel Begunkov Oct. 20, 2021, 2:54 p.m. UTC | #5

On 10/19/21 22:24, Pavel Begunkov wrote:
> Jens tried out a similar series with some not yet sent additions:
> 8.2-8.3 MIOPS -> ~9 MIOPS, or 8-10%.
> 
> 12/16 is bulky, but it nicely drives the numbers. Moreover, with
> it we can rid of some not used anymore optimisations in
> __blkdev_direct_IO() because it awlays serve multiple bios.
> E.g. no need in conditional referencing with DIO_MULTI_BIO,
> and _probably_ can be converted to chained bio.
Some numbers, using nullblk is not perfect, but empirically
from numbers Jens posts his Optane setup usually gives somewhat
relatable results in terms of % difference. (probably, divide
the difference in percents by 2 for the worst case).

modprobe null_blk no_sched=1 irqmode=1 completion_nsec=0 submit_queues=16 poll_queues=32
echo 0 > /sys/block/nullb0/queue/iostats
echo 2 > /sys/block/nullb0/queue/nomerges
nice -n -20 taskset -c 0 ./io_uring -d32 -s32 -c32 -p1 -B1 -F1 -b512 /dev/nullb0
# polled=1, fixedbufs=1, register_files=1, buffered=0 QD=32, sq_ring=32, cq_ring=64

# baseline (for-5.16/block)

IOPS=4304768, IOS/call=32/32, inflight=32 (32)
IOPS=4289824, IOS/call=32/32, inflight=32 (32)
IOPS=4227808, IOS/call=32/32, inflight=32 (32)
IOPS=4187008, IOS/call=32/32, inflight=32 (32)
IOPS=4196992, IOS/call=32/32, inflight=32 (32)
IOPS=4208384, IOS/call=32/32, inflight=32 (32)
IOPS=4233888, IOS/call=32/32, inflight=32 (32)
IOPS=4266432, IOS/call=32/32, inflight=32 (32)
IOPS=4232352, IOS/call=32/32, inflight=32 (32)

# + patch 14/16 (skip advance)

IOPS=4367424, IOS/call=32/32, inflight=0 (16)
IOPS=4401088, IOS/call=32/32, inflight=32 (32)
IOPS=4400544, IOS/call=32/32, inflight=0 (29)
IOPS=4400768, IOS/call=32/32, inflight=32 (32)
IOPS=4409568, IOS/call=32/32, inflight=32 (32)
IOPS=4373888, IOS/call=32/32, inflight=32 (32)
IOPS=4392544, IOS/call=32/32, inflight=32 (32)
IOPS=4368192, IOS/call=32/32, inflight=32 (32)
IOPS=4362976, IOS/call=32/32, inflight=32 (32)

Comparing profiling. Before:
+    1.75%  io_uring  [kernel.vmlinux]  [k] bio_iov_iter_get_pages
+    0.90%  io_uring  [kernel.vmlinux]  [k] iov_iter_advance

After:
+    0.91%  io_uring  [kernel.vmlinux]  [k] bio_iov_iter_get_pages_hint
[no iov_iter_advance]

# + patches 15,16 (switch optimisation)

IOPS=4485984, IOS/call=32/32, inflight=32 (32)
IOPS=4500384, IOS/call=32/32, inflight=32 (32)
IOPS=4524512, IOS/call=32/32, inflight=32 (32)
IOPS=4507424, IOS/call=32/32, inflight=32 (32)
IOPS=4497216, IOS/call=32/32, inflight=32 (32)
IOPS=4496832, IOS/call=32/32, inflight=32 (32)
IOPS=4505632, IOS/call=32/32, inflight=32 (32)
IOPS=4476224, IOS/call=32/32, inflight=32 (32)
IOPS=4478592, IOS/call=32/32, inflight=32 (32)
IOPS=4480128, IOS/call=32/32, inflight=32 (32)
IOPS=4468640, IOS/call=32/32, inflight=32 (32)

Before:
+    1.92%  io_uring  [kernel.vmlinux]  [k] submit_bio_checks
+    5.56%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio
After:
+    1.66%  io_uring  [kernel.vmlinux]  [k] submit_bio_checks
+    5.49%  io_uring  [kernel.vmlinux]  [k] blk_mq_submit_bio

0.3% difference from perf, ~2% from absolute numbers, which is
most probably just a coincidence. But 0.3% looks realistic.

[00/16] block optimisation round

Message

Comments