[V15,00/18] block: support multi-page bvec

Message ID	20190215111324.30129-1-ming.lei@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Theodore Ts'o <tytso@mit.edu>, Omar Sandoval <osandov@fb.com>, Sagi Grimberg <sagi@grimberg.me>, Dave Chinner <dchinner@redhat.com>, Kent Overstreet <kent.overstreet@gmail.com>, Mike Snitzer <snitzer@redhat.com>, dm-devel@redhat.com, Alexander Viro <viro@zeniv.linux.org.uk>, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, David Sterba <dsterba@suse.com>, linux-btrfs@vger.kernel.org, "Darrick J . Wong" <darrick.wong@oracle.com>, linux-xfs@vger.kernel.org, Gao Xiang <gaoxiang25@huawei.com>, Christoph Hellwig <hch@lst.de>, linux-ext4@vger.kernel.org, Coly Li <colyli@suse.de>, linux-bcache@vger.kernel.org, Boaz Harrosh <ooo@electrozaur.com>, Bob Peterson <rpeterso@redhat.com>, cluster-devel@redhat.com, Ming Lei <ming.lei@redhat.com> Subject: [PATCH V15 00/18] block: support multi-page bvec Date: Fri, 15 Feb 2019 19:13:06 +0800 Message-Id: <20190215111324.30129-1-ming.lei@redhat.com> Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	block: support multi-page bvec \| expand [V15,00/18] block: support multi-page bvec [V15,01/18] btrfs: look at bi_size for repair decisions [V15,02/18] block: don't use bio->bi_vcnt to figure out segment number [V15,03/18] block: remove bvec_iter_rewind() [V15,04/18] block: introduce multi-page bvec helpers [V15,05/18] block: introduce bio_for_each_bvec() and rq_for_each_bvec() [V15,06/18] block: use bio_for_each_bvec() to compute multi-page bvec count [V15,07/18] block: use bio_for_each_bvec() to map sg [V15,08/18] block: introduce mp_bvec_last_segment() [V15,09/18] fs/buffer.c: use bvec iterator to truncate the bio [V15,10/18] btrfs: use mp_bvec_last_segment to get bio's last page [V15,11/18] block: loop: pass multi-page bvec to iov_iter [V15,12/18] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages() [V15,13/18] block: allow bio_for_each_segment_all() to iterate over multi-page bvec [V15,14/18] block: enable multipage bvecs [V15,15/18] block: always define BIO_MAX_PAGES as 256 [V15,16/18] block: document usage of bio iterator helpers [V15,17/18] block: kill QUEUE_FLAG_NO_SG_MERGE [V15,18/18] block: kill BLK_MQ_F_SG_MERGE

Ming Lei Feb. 15, 2019, 11:13 a.m. UTC

Hi,

This patchset brings multi-page bvec into block layer:

1) what is multi-page bvec?

Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
which are physically contiguous instead of one single page used in linux
kernel for long time.

2) why is multi-page bvec introduced?

Kent proposed the idea[1] first. 

As system's RAM becomes much bigger than before, and huge page, transparent
huge page and memory compaction are widely used, it is a bit easy now
to see physically contiguous pages from fs in I/O. On the other hand, from
block layer's view, it isn't necessary to store intermediate pages into bvec,
and it is enough to just store the physicallly contiguous 'segment' in each
io vector.

Also huge pages are being brought to filesystem and swap [2][6], we can
do IO on a hugepage each time[3], which requires that one bio can transfer
at least one huge page one time. Turns out it isn't flexiable to change
BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
as much bigger, such as 512, which requires at least two 4K pages for holding
the bvec table.

With multi-page bvec:

- Inside block layer, both bio splitting and sg map can become more
efficient than before by just traversing the physically contiguous
'segment' instead of each page.

- segment handling in block layer can be improved much in future since it
should be quite easy to convert multipage bvec into segment easily. For
example, we might just store segment in each bvec directly in future.

- bio size can be increased and it should improve some high-bandwidth IO
case in theory[4].

- there is opportunity in future to improve memory footprint of bvecs. 

3) how is multi-page bvec implemented in this patchset?

Patch 1 ~ 3 parpares for supporting multi-page bvec. 

Patches 4 ~ 14 implement multipage bvec in block layer:

	- put all tricks into bvec/bio/rq iterators, and as far as
	drivers and fs use these standard iterators, they are happy
	with multipage bvec

	- introduce bio_for_each_bvec() to iterate over multipage bvec for splitting
	bio and mapping sg

	- keep current bio_for_each_segment*() to itereate over singlepage bvec and
	make sure current users won't be broken; especailly, convert to this
	new helper prototype in single patch 21 given it is bascially a mechanism
	conversion

	- deal with iomap & xfs's sub-pagesize io vec in patch 13

	- enalbe multipage bvec in patch 14 

Patch 15 redefines BIO_MAX_PAGES as 256.

Patch 16 documents usages of bio iterator helpers.

Patch 17~18 kills NO_SG_MERGE.

These patches can be found in the following git tree:

	git:  https://github.com/ming1/linux.git  v5.0-blk_mp_bvec_v14

Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
and not see regression.

Thanks Christoph for reviewing the early version and providing very good
suggestions, such as: introduce bio_init_with_vec_table(), remove another
unnecessary helpers for cleanup and so on.

Thanks Chritoph and Omar for reviewing V10/V11/V12, and provides lots of
helpful comments.

V15:
	- rename bio_for_each_mp_bvec/rq_for_each_mp_bvec as
	  bio_for_each_bvec/rq_for_each_bvec, as suggested by Christoph,
	  so the mp_bvec name is only used by bvec helpers

V14:
	- drop patch(patch 4 in V13) for renaming bvec helpers, as suggested by Jens
	- use mp_bvec_* as multi-page bvec helper name
	- fix one build issue, which is caused by missing one converion of
	bio_for_each_segment_all in fs/gfs2
	- fix one 32bit ARCH specific issue caused by segment boundary mask
	overflow

V13:
	- rebase on v5.0-rc2
	- address Omar's comment on patch 1 of V12 by using V11's approach
	- rename one local vairable in patch 15 as suggested by Christoph

V12:
	- deal with non-cluster by max segment size & segment boundary limit
	- rename bvec helper's name
	- revert new change on bvec_iter_advance() in V11
	- introduce rq_for_each_bvec()
	- use simpler check on enalbing multi-page bvec
	- fix Document change

V11:
	- address most of reviews from Omar and christoph
	- rename mp_bvec_* as segment_* helpers
	- remove 'mp' parameter from bvec_iter_advance() and related helpers
	- cleanup patch on bvec_split_segs() and blk_bio_segment_split(),
	  remove unnecessary checks
	- simplify bvec_last_segment()
	- drop bio_pages_all()
	- introduce dedicated functions/file for handling non-cluser bio for
	avoiding checking queue cluster before adding page to bio
	- introduce bio_try_merge_segment() for simplifying iomap/xfs page
	  accounting code
	- Fix Document change

V10:
	- no any code change, just add more guys and list into patch's CC list,
	as suggested by Christoph and Dave Chinner
V9:
	- fix regression on iomap's sub-pagesize io vec, covered by patch 13
V8:
	- remove prepare patches which all are merged to linus tree
	- rebase on for-4.21/block
	- address comments on V7
	- add patches of killing NO_SG_MERGE

V7:
	- include Christoph and Mike's bio_clone_bioset() patches, which is
	  actually prepare patches for multipage bvec
	- address Christoph's comments

V6:
	- avoid to introduce lots of renaming, follow Jen's suggestion of
	using the name of chunk for multipage io vector
	- include Christoph's three prepare patches
	- decrease stack usage for using bio_for_each_chunk_segment_all()
	- address Kent's comment

V5:
	- remove some of prepare patches, which have been merged already
	- add bio_clone_seg_bioset() to fix DM's bio clone, which
	is introduced by 18a25da84354c6b (dm: ensure bio submission follows
	a depth-first tree walk)
	- rebase on the latest block for-v4.18

V4:
	- rename bio_for_each_segment*() as bio_for_each_page*(), rename
	bio_segments() as bio_pages(), rename rq_for_each_segment() as
	rq_for_each_pages(), because these helpers never return real
	segment, and they always return single page bvec
	
	- introducing segment_for_each_page_all()

	- introduce new bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
	for returning real multipage segment

	- rewrite segment_last_page()

	- rename bvec iterator helper as suggested by Christoph

	- replace comment with applying bio helpers as suggested by Christoph

	- document usage of bio iterator helpers

	- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
	accommodated in 4K page

	- move bio_alloc_pages() into bcache as suggested by Christoph

V3:
	- rebase on v4.13-rc3 with for-next of block tree
	- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
	MD(raid1), and not see regressions triggered
	- add Reviewed-by on some btrfs patches
	- remove two MD patches because both are merged to linus tree
	  already

V2:
	- bvec table direct access in raid has been cleaned, so NO_MP
	flag is dropped
	- rebase on recent Neil Brown's change on bio and bounce code
	- reorganize the patchset

V1:
	- against v4.10-rc1 and some cleanup in V0 are in -linus already
	- handle queue_virt_boundary() in mp bvec change and make NVMe happy
	- further BTRFS cleanup
	- remove QUEUE_FLAG_SPLIT_MP
	- rename for two new helpers of bio_for_each_segment_all()
	- fix bounce convertion
	- address comments in V0

[1], http://marc.info/?l=linux-kernel&m=141680246629547&w=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=147735447100001&r=1&w=2
[4], http://marc.info/?l=linux-mm&m=147745525801433&w=2
[5], http://marc.info/?t=149569484500007&r=1&w=2
[6], http://marc.info/?t=149820215300004&r=1&w=2



Christoph Hellwig (1):
  btrfs: look at bi_size for repair decisions

Ming Lei (17):
  block: don't use bio->bi_vcnt to figure out segment number
  block: remove bvec_iter_rewind()
  block: introduce multi-page bvec helpers
  block: introduce bio_for_each_bvec() and rq_for_each_bvec()
  block: use bio_for_each_bvec() to compute multi-page bvec count
  block: use bio_for_each_bvec() to map sg
  block: introduce mp_bvec_last_segment()
  fs/buffer.c: use bvec iterator to truncate the bio
  btrfs: use mp_bvec_last_segment to get bio's last page
  block: loop: pass multi-page bvec to iov_iter
  bcache: avoid to use bio_for_each_segment_all() in
    bch_bio_alloc_pages()
  block: allow bio_for_each_segment_all() to iterate over multi-page
    bvec
  block: enable multipage bvecs
  block: always define BIO_MAX_PAGES as 256
  block: document usage of bio iterator helpers
  block: kill QUEUE_FLAG_NO_SG_MERGE
  block: kill BLK_MQ_F_SG_MERGE

 Documentation/block/biovecs.txt   |  25 +++++
 block/bio.c                       |  49 ++++++---
 block/blk-merge.c                 | 210 +++++++++++++++++++++++++-------------
 block/blk-mq-debugfs.c            |   2 -
 block/blk-mq.c                    |   3 -
 block/bounce.c                    |   6 +-
 drivers/block/loop.c              |  22 ++--
 drivers/block/nbd.c               |   2 +-
 drivers/block/rbd.c               |   2 +-
 drivers/block/skd_main.c          |   1 -
 drivers/block/xen-blkfront.c      |   2 +-
 drivers/md/bcache/btree.c         |   3 +-
 drivers/md/bcache/util.c          |   6 +-
 drivers/md/dm-crypt.c             |   3 +-
 drivers/md/dm-rq.c                |   2 +-
 drivers/md/dm-table.c             |  13 ---
 drivers/md/raid1.c                |   3 +-
 drivers/mmc/core/queue.c          |   3 +-
 drivers/scsi/scsi_lib.c           |   2 +-
 drivers/staging/erofs/data.c      |   3 +-
 drivers/staging/erofs/unzip_vle.c |   3 +-
 fs/block_dev.c                    |   6 +-
 fs/btrfs/compression.c            |   3 +-
 fs/btrfs/disk-io.c                |   3 +-
 fs/btrfs/extent_io.c              |  16 +--
 fs/btrfs/inode.c                  |   6 +-
 fs/btrfs/raid56.c                 |   3 +-
 fs/buffer.c                       |   5 +-
 fs/crypto/bio.c                   |   3 +-
 fs/direct-io.c                    |   4 +-
 fs/exofs/ore.c                    |   3 +-
 fs/exofs/ore_raid.c               |   3 +-
 fs/ext4/page-io.c                 |   3 +-
 fs/ext4/readpage.c                |   3 +-
 fs/f2fs/data.c                    |   9 +-
 fs/gfs2/lops.c                    |   9 +-
 fs/gfs2/meta_io.c                 |   3 +-
 fs/iomap.c                        |  10 +-
 fs/mpage.c                        |   3 +-
 fs/xfs/xfs_aops.c                 |   9 +-
 include/linux/bio.h               |  37 ++++---
 include/linux/blk-mq.h            |   1 -
 include/linux/blkdev.h            |   5 +-
 include/linux/bvec.h              | 106 ++++++++++++++-----
 44 files changed, 404 insertions(+), 214 deletions(-)

Christoph Hellwig Feb. 15, 2019, 2:51 p.m. UTC | #1

I still don't understand why mp_bvec_last_segment isn't simply
called bvec_last_segment as there is no conflict.  But I don't
want to hold this series up on that as there only are two users
left and we can always just fix it up later.

Jens Axboe Feb. 15, 2019, 3:49 p.m. UTC | #2

On 2/15/19 4:13 AM, Ming Lei wrote:
> Hi,
> 
> This patchset brings multi-page bvec into block layer:
> 
> 1) what is multi-page bvec?
> 
> Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
> which are physically contiguous instead of one single page used in linux
> kernel for long time.
> 
> 2) why is multi-page bvec introduced?
> 
> Kent proposed the idea[1] first. 
> 
> As system's RAM becomes much bigger than before, and huge page, transparent
> huge page and memory compaction are widely used, it is a bit easy now
> to see physically contiguous pages from fs in I/O. On the other hand, from
> block layer's view, it isn't necessary to store intermediate pages into bvec,
> and it is enough to just store the physicallly contiguous 'segment' in each
> io vector.
> 
> Also huge pages are being brought to filesystem and swap [2][6], we can
> do IO on a hugepage each time[3], which requires that one bio can transfer
> at least one huge page one time. Turns out it isn't flexiable to change
> BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
> As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
> as much bigger, such as 512, which requires at least two 4K pages for holding
> the bvec table.
> 
> With multi-page bvec:
> 
> - Inside block layer, both bio splitting and sg map can become more
> efficient than before by just traversing the physically contiguous
> 'segment' instead of each page.
> 
> - segment handling in block layer can be improved much in future since it
> should be quite easy to convert multipage bvec into segment easily. For
> example, we might just store segment in each bvec directly in future.
> 
> - bio size can be increased and it should improve some high-bandwidth IO
> case in theory[4].
> 
> - there is opportunity in future to improve memory footprint of bvecs. 
> 
> 3) how is multi-page bvec implemented in this patchset?
> 
> Patch 1 ~ 3 parpares for supporting multi-page bvec. 
> 
> Patches 4 ~ 14 implement multipage bvec in block layer:
> 
> 	- put all tricks into bvec/bio/rq iterators, and as far as
> 	drivers and fs use these standard iterators, they are happy
> 	with multipage bvec
> 
> 	- introduce bio_for_each_bvec() to iterate over multipage bvec for splitting
> 	bio and mapping sg
> 
> 	- keep current bio_for_each_segment*() to itereate over singlepage bvec and
> 	make sure current users won't be broken; especailly, convert to this
> 	new helper prototype in single patch 21 given it is bascially a mechanism
> 	conversion
> 
> 	- deal with iomap & xfs's sub-pagesize io vec in patch 13
> 
> 	- enalbe multipage bvec in patch 14 
> 
> Patch 15 redefines BIO_MAX_PAGES as 256.
> 
> Patch 16 documents usages of bio iterator helpers.
> 
> Patch 17~18 kills NO_SG_MERGE.
> 
> These patches can be found in the following git tree:
> 
> 	git:  https://github.com/ming1/linux.git  v5.0-blk_mp_bvec_v14
                                                                   ^^^

v15?

> Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
> and not see regression.
> 
> Thanks Christoph for reviewing the early version and providing very good
> suggestions, such as: introduce bio_init_with_vec_table(), remove another
> unnecessary helpers for cleanup and so on.
> 
> Thanks Chritoph and Omar for reviewing V10/V11/V12, and provides lots of
> helpful comments.

Applied, thanks Ming. Let's hope it sticks!

Bart Van Assche Feb. 15, 2019, 5:14 p.m. UTC | #3

On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> On 2/15/19 4:13 AM, Ming Lei wrote:
> > This patchset brings multi-page bvec into block layer:
> 
> Applied, thanks Ming. Let's hope it sticks!

Hi Jens and Ming,

Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
I have not yet tried to figure out which patch introduced the failure.
Anyway, this is what I see in the kernel log for test nvmeof-mp/002:

[  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
[  475.621188] #PF error: [normal kernel read fault]
[  475.623148] PGD 0 P4D 0  
[  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
[  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
[  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  475.633855] Workqueue: kblockd blk_mq_requeue_work
[  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
[  475.670948] Call Trace:
[  475.693515]  blk_recalc_rq_segments+0x2f/0x50
[  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
[  475.701142]  dm_mq_queue_rq+0x3d1/0x770
[  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
[  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
[  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
[  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
[  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
[  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
[  475.733468]  blk_mq_requeue_work+0x2cb/0x300
[  475.736473]  process_one_work+0x4f1/0xa40
[  475.739424]  worker_thread+0x67/0x5b0
[  475.741751]  kthread+0x1cf/0x1f0
[  475.746034]  ret_from_fork+0x24/0x30

(gdb) list *(__blk_recalc_rq_segments+0xbe)
0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
361                                                  struct bio *bio)
362     {
363             struct bio_vec bv, bvprv = { NULL };
364             int prev = 0;
365             unsigned int seg_size, nr_phys_segs;
366             unsigned front_seg_size = bio->bi_seg_front_size;
367             struct bio *fbio, *bbio;
368             struct bvec_iter iter;
369
370             if (!bio)

Bart.

Jens Axboe Feb. 15, 2019, 5:59 p.m. UTC | #4

On 2/15/19 10:14 AM, Bart Van Assche wrote:
> On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
>> On 2/15/19 4:13 AM, Ming Lei wrote:
>>> This patchset brings multi-page bvec into block layer:
>>
>> Applied, thanks Ming. Let's hope it sticks!
> 
> Hi Jens and Ming,
> 
> Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> I have not yet tried to figure out which patch introduced the failure.
> Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> 
> [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> [  475.621188] #PF error: [normal kernel read fault]
> [  475.623148] PGD 0 P4D 0  
> [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> [  475.670948] Call Trace:
> [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> [  475.736473]  process_one_work+0x4f1/0xa40
> [  475.739424]  worker_thread+0x67/0x5b0
> [  475.741751]  kthread+0x1cf/0x1f0
> [  475.746034]  ret_from_fork+0x24/0x30
> 
> (gdb) list *(__blk_recalc_rq_segments+0xbe)
> 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> 361                                                  struct bio *bio)
> 362     {
> 363             struct bio_vec bv, bvprv = { NULL };
> 364             int prev = 0;
> 365             unsigned int seg_size, nr_phys_segs;
> 366             unsigned front_seg_size = bio->bi_seg_front_size;
> 367             struct bio *fbio, *bbio;
> 368             struct bvec_iter iter;
> 369
> 370             if (!bio)

Just ran a few tests, and it also seems to cause about a 5% regression
in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
to be blk_queue_split() and blk_rq_map_sg().

Ming Lei Feb. 17, 2019, 1:10 p.m. UTC | #5

On Fri, Feb 15, 2019 at 03:51:26PM +0100, Christoph Hellwig wrote:
> I still don't understand why mp_bvec_last_segment isn't simply
> called bvec_last_segment as there is no conflict.  But I don't
> want to hold this series up on that as there only are two users
> left and we can always just fix it up later.

mp_bvec_last_segment() is one bvec helper, so better to keep its
name consistent with other bvec helpers.

Thanks,
Ming

Ming Lei Feb. 17, 2019, 1:11 p.m. UTC | #6

On Fri, Feb 15, 2019 at 09:14:15AM -0800, Bart Van Assche wrote:
> On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> > On 2/15/19 4:13 AM, Ming Lei wrote:
> > > This patchset brings multi-page bvec into block layer:
> > 
> > Applied, thanks Ming. Let's hope it sticks!
> 
> Hi Jens and Ming,
> 
> Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> I have not yet tried to figure out which patch introduced the failure.
> Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> 
> [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> [  475.621188] #PF error: [normal kernel read fault]
> [  475.623148] PGD 0 P4D 0  
> [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> [  475.670948] Call Trace:
> [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> [  475.736473]  process_one_work+0x4f1/0xa40
> [  475.739424]  worker_thread+0x67/0x5b0
> [  475.741751]  kthread+0x1cf/0x1f0
> [  475.746034]  ret_from_fork+0x24/0x30
> 
> (gdb) list *(__blk_recalc_rq_segments+0xbe)
> 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> 361                                                  struct bio *bio)
> 362     {
> 363             struct bio_vec bv, bvprv = { NULL };
> 364             int prev = 0;
> 365             unsigned int seg_size, nr_phys_segs;
> 366             unsigned front_seg_size = bio->bi_seg_front_size;
> 367             struct bio *fbio, *bbio;
> 368             struct bvec_iter iter;
> 369
> 370             if (!bio)
> 
> Bart.

Thanks for your test!

The following patch should fix this issue:


diff --git a/block/blk-merge.c b/block/blk-merge.c
index bed065904677..066b66430523 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
 	struct bio_vec bv, bvprv = { NULL };
 	int prev = 0;
 	unsigned int seg_size, nr_phys_segs;
-	unsigned front_seg_size = bio->bi_seg_front_size;
+	unsigned front_seg_size;
 	struct bio *fbio, *bbio;
 	struct bvec_iter iter;
 
 	if (!bio)
 		return 0;
 
+	front_seg_size = bio->bi_seg_front_size;
+
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
 	case REQ_OP_SECURE_ERASE:

Thanks,
Ming

Ming Lei Feb. 17, 2019, 1:13 p.m. UTC | #7

On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote:
> On 2/15/19 10:14 AM, Bart Van Assche wrote:
> > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> >> On 2/15/19 4:13 AM, Ming Lei wrote:
> >>> This patchset brings multi-page bvec into block layer:
> >>
> >> Applied, thanks Ming. Let's hope it sticks!
> > 
> > Hi Jens and Ming,
> > 
> > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> > I have not yet tried to figure out which patch introduced the failure.
> > Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> > 
> > [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > [  475.621188] #PF error: [normal kernel read fault]
> > [  475.623148] PGD 0 P4D 0  
> > [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> > [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> > [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> > [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> > [  475.670948] Call Trace:
> > [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> > [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> > [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> > [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> > [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> > [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> > [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> > [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> > [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> > [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> > [  475.736473]  process_one_work+0x4f1/0xa40
> > [  475.739424]  worker_thread+0x67/0x5b0
> > [  475.741751]  kthread+0x1cf/0x1f0
> > [  475.746034]  ret_from_fork+0x24/0x30
> > 
> > (gdb) list *(__blk_recalc_rq_segments+0xbe)
> > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> > 361                                                  struct bio *bio)
> > 362     {
> > 363             struct bio_vec bv, bvprv = { NULL };
> > 364             int prev = 0;
> > 365             unsigned int seg_size, nr_phys_segs;
> > 366             unsigned front_seg_size = bio->bi_seg_front_size;
> > 367             struct bio *fbio, *bbio;
> > 368             struct bvec_iter iter;
> > 369
> > 370             if (!bio)
> 
> Just ran a few tests, and it also seems to cause about a 5% regression
> in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
> rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
> to be blk_queue_split() and blk_rq_map_sg().

Could you share us your test setting?

I will run null_blk first and see if it can be reproduced.

Thanks,
Ming

Ming Lei Feb. 18, 2019, 7:49 a.m. UTC | #8

On Sun, Feb 17, 2019 at 09:13:32PM +0800, Ming Lei wrote:
> On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote:
> > On 2/15/19 10:14 AM, Bart Van Assche wrote:
> > > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> > >> On 2/15/19 4:13 AM, Ming Lei wrote:
> > >>> This patchset brings multi-page bvec into block layer:
> > >>
> > >> Applied, thanks Ming. Let's hope it sticks!
> > > 
> > > Hi Jens and Ming,
> > > 
> > > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> > > I have not yet tried to figure out which patch introduced the failure.
> > > Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> > > 
> > > [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
> > > [  475.621188] #PF error: [normal kernel read fault]
> > > [  475.623148] PGD 0 P4D 0  
> > > [  475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN
> > > [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G    B             5.0.0-rc6-dbg+ #1
> > > [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> > > [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> > > [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> > > [  475.670948] Call Trace:
> > > [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> > > [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> > > [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> > > [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> > > [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> > > [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> > > [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> > > [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> > > [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> > > [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> > > [  475.736473]  process_one_work+0x4f1/0xa40
> > > [  475.739424]  worker_thread+0x67/0x5b0
> > > [  475.741751]  kthread+0x1cf/0x1f0
> > > [  475.746034]  ret_from_fork+0x24/0x30
> > > 
> > > (gdb) list *(__blk_recalc_rq_segments+0xbe)
> > > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> > > 361                                                  struct bio *bio)
> > > 362     {
> > > 363             struct bio_vec bv, bvprv = { NULL };
> > > 364             int prev = 0;
> > > 365             unsigned int seg_size, nr_phys_segs;
> > > 366             unsigned front_seg_size = bio->bi_seg_front_size;
> > > 367             struct bio *fbio, *bbio;
> > > 368             struct bvec_iter iter;
> > > 369
> > > 370             if (!bio)
> > 
> > Just ran a few tests, and it also seems to cause about a 5% regression
> > in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
> > rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
> > to be blk_queue_split() and blk_rq_map_sg().
> 
> Could you share us your test setting?
> 
> I will run null_blk first and see if it can be reproduced.

Looks this performance drop isn't reproduced on null_blk with the following
setting by me:

- modprobe null_blk nr_devices=4 submit_queues=48
- test machine : dual socket, two NUMA nodes, 24cores/socket
- fio script:
fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=48 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 --name=randread --rw=randread

result: 10.7M IOPS(base kernel), 10.6M IOPS(patched kernel)

And if 'bs' is increased to 256k, 512k, 1024k, IOPS improvement can be ~8%
with multi-page bvec patches in above test.

BTW, there isn't cost added to bio_for_each_bvec(), so blk_queue_split() and
blk_rq_map_sg() should be fine. However, bio_for_each_segment_all()
may not be quick as before.


Thanks,
Ming

Bart Van Assche Feb. 19, 2019, 4:28 p.m. UTC | #9

On Sun, 2019-02-17 at 21:11 +0800, Ming Lei wrote:
> The following patch should fix this issue:
> 
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index bed065904677..066b66430523 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
>  	struct bio_vec bv, bvprv = { NULL };
>  	int prev = 0;
>  	unsigned int seg_size, nr_phys_segs;
> -	unsigned front_seg_size = bio->bi_seg_front_size;
> +	unsigned front_seg_size;
>  	struct bio *fbio, *bbio;
>  	struct bvec_iter iter;
>  
>  	if (!bio)
>  		return 0;
>  
> +	front_seg_size = bio->bi_seg_front_size;
> +
>  	switch (bio_op(bio)) {
>  	case REQ_OP_DISCARD:
>  	case REQ_OP_SECURE_ERASE:

Hi Ming,

With this patch applied test nvmeof-mp/002 fails as follows:

[  694.700400] kernel BUG at lib/sg_pool.c:103!
[  694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[  694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G    B             5.0.0-rc6-dbg+ #2
[  694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[  694.715113] Workqueue: kblockd blk_mq_run_work_fn
[  694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0
[  694.758222] Call Trace:
[  694.759645]  nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma]
[  694.764915]  blk_mq_try_issue_directly+0x2a5/0x4b0
[  694.771779]  blk_insert_cloned_request+0x11e/0x1c0
[  694.778417]  dm_mq_queue_rq+0x3d1/0x770
[  694.793400]  blk_mq_dispatch_rq_list+0x5fc/0xb10
[  694.798386]  blk_mq_sched_dispatch_requests+0x2f7/0x300
[  694.803180]  __blk_mq_run_hw_queue+0xd6/0x180
[  694.808933]  blk_mq_run_work_fn+0x27/0x30
[  694.810315]  process_one_work+0x4f1/0xa40
[  694.813178]  worker_thread+0x67/0x5b0
[  694.814487]  kthread+0x1cf/0x1f0
[  694.819134]  ret_from_fork+0x24/0x30

The code in sg_pool.c that triggers the BUG() statement is as follows:

int sg_alloc_table_chained(struct sg_table *table, int nents,
		struct scatterlist *first_chunk)
{
	int ret;

	BUG_ON(!nents);
[ ... ]

Bart.

Ming Lei Feb. 20, 2019, 1:17 a.m. UTC | #10

On Tue, Feb 19, 2019 at 08:28:19AM -0800, Bart Van Assche wrote:
> On Sun, 2019-02-17 at 21:11 +0800, Ming Lei wrote:
> > The following patch should fix this issue:
> > 
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index bed065904677..066b66430523 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> >  	struct bio_vec bv, bvprv = { NULL };
> >  	int prev = 0;
> >  	unsigned int seg_size, nr_phys_segs;
> > -	unsigned front_seg_size = bio->bi_seg_front_size;
> > +	unsigned front_seg_size;
> >  	struct bio *fbio, *bbio;
> >  	struct bvec_iter iter;
> >  
> >  	if (!bio)
> >  		return 0;
> >  
> > +	front_seg_size = bio->bi_seg_front_size;
> > +
> >  	switch (bio_op(bio)) {
> >  	case REQ_OP_DISCARD:
> >  	case REQ_OP_SECURE_ERASE:
> 
> Hi Ming,
> 
> With this patch applied test nvmeof-mp/002 fails as follows:
> 
> [  694.700400] kernel BUG at lib/sg_pool.c:103!
> [  694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> [  694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G    B             5.0.0-rc6-dbg+ #2
> [  694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [  694.715113] Workqueue: kblockd blk_mq_run_work_fn
> [  694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0
> [  694.758222] Call Trace:
> [  694.759645]  nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma]
> [  694.764915]  blk_mq_try_issue_directly+0x2a5/0x4b0
> [  694.771779]  blk_insert_cloned_request+0x11e/0x1c0
> [  694.778417]  dm_mq_queue_rq+0x3d1/0x770
> [  694.793400]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> [  694.798386]  blk_mq_sched_dispatch_requests+0x2f7/0x300
> [  694.803180]  __blk_mq_run_hw_queue+0xd6/0x180
> [  694.808933]  blk_mq_run_work_fn+0x27/0x30
> [  694.810315]  process_one_work+0x4f1/0xa40
> [  694.813178]  worker_thread+0x67/0x5b0
> [  694.814487]  kthread+0x1cf/0x1f0
> [  694.819134]  ret_from_fork+0x24/0x30
> 
> The code in sg_pool.c that triggers the BUG() statement is as follows:
> 
> int sg_alloc_table_chained(struct sg_table *table, int nents,
> 		struct scatterlist *first_chunk)
> {
> 	int ret;
> 
> 	BUG_ON(!nents);
> [ ... ]
> 
> Bart.

I can reproduce this issue("kernel BUG at lib/sg_pool.c:103") without mp-bvec patches,
so looks it isn't the fault of this patchset.

Thanks,
Ming

Bart Van Assche Feb. 20, 2019, 2:37 a.m. UTC | #11

On 2/19/19 5:17 PM, Ming Lei wrote:
> On Tue, Feb 19, 2019 at 08:28:19AM -0800, Bart Van Assche wrote:
>> With this patch applied test nvmeof-mp/002 fails as follows:
>>
>> [  694.700400] kernel BUG at lib/sg_pool.c:103!
>> [  694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
>> [  694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G    B             5.0.0-rc6-dbg+ #2
>> [  694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> [  694.715113] Workqueue: kblockd blk_mq_run_work_fn
>> [  694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0
>> [  694.758222] Call Trace:
>> [  694.759645]  nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma]
>> [  694.764915]  blk_mq_try_issue_directly+0x2a5/0x4b0
>> [  694.771779]  blk_insert_cloned_request+0x11e/0x1c0
>> [  694.778417]  dm_mq_queue_rq+0x3d1/0x770
>> [  694.793400]  blk_mq_dispatch_rq_list+0x5fc/0xb10
>> [  694.798386]  blk_mq_sched_dispatch_requests+0x2f7/0x300
>> [  694.803180]  __blk_mq_run_hw_queue+0xd6/0x180
>> [  694.808933]  blk_mq_run_work_fn+0x27/0x30
>> [  694.810315]  process_one_work+0x4f1/0xa40
>> [  694.813178]  worker_thread+0x67/0x5b0
>> [  694.814487]  kthread+0x1cf/0x1f0
>> [  694.819134]  ret_from_fork+0x24/0x30
>>
>> The code in sg_pool.c that triggers the BUG() statement is as follows:
>>
>> int sg_alloc_table_chained(struct sg_table *table, int nents,
>> 		struct scatterlist *first_chunk)
>> {
>> 	int ret;
>>
>> 	BUG_ON(!nents);
>> [ ... ]
>>
>> Bart.
> 
> I can reproduce this issue("kernel BUG at lib/sg_pool.c:103") without mp-bvec patches,
> so looks it isn't the fault of this patchset.

Thanks Ming for your feedback.

Jens, I don't see that issue with kernel v5.0-rc6. Does that mean that 
the sg_pool BUG() is a regression in your for-next branch that predates 
Ming's multi-page bvec patch series?

Thanks,

Bart.

[V15,00/18] block: support multi-page bvec

Message

Comments