Message ID | 20190215111324.30129-1-ming.lei@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | block: support multi-page bvec | expand |
I still don't understand why mp_bvec_last_segment isn't simply called bvec_last_segment as there is no conflict. But I don't want to hold this series up on that as there only are two users left and we can always just fix it up later.
On 2/15/19 4:13 AM, Ming Lei wrote: > Hi, > > This patchset brings multi-page bvec into block layer: > > 1) what is multi-page bvec? > > Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages > which are physically contiguous instead of one single page used in linux > kernel for long time. > > 2) why is multi-page bvec introduced? > > Kent proposed the idea[1] first. > > As system's RAM becomes much bigger than before, and huge page, transparent > huge page and memory compaction are widely used, it is a bit easy now > to see physically contiguous pages from fs in I/O. On the other hand, from > block layer's view, it isn't necessary to store intermediate pages into bvec, > and it is enough to just store the physicallly contiguous 'segment' in each > io vector. > > Also huge pages are being brought to filesystem and swap [2][6], we can > do IO on a hugepage each time[3], which requires that one bio can transfer > at least one huge page one time. Turns out it isn't flexiable to change > BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well. > As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured > as much bigger, such as 512, which requires at least two 4K pages for holding > the bvec table. > > With multi-page bvec: > > - Inside block layer, both bio splitting and sg map can become more > efficient than before by just traversing the physically contiguous > 'segment' instead of each page. > > - segment handling in block layer can be improved much in future since it > should be quite easy to convert multipage bvec into segment easily. For > example, we might just store segment in each bvec directly in future. > > - bio size can be increased and it should improve some high-bandwidth IO > case in theory[4]. > > - there is opportunity in future to improve memory footprint of bvecs. > > 3) how is multi-page bvec implemented in this patchset? > > Patch 1 ~ 3 parpares for supporting multi-page bvec. > > Patches 4 ~ 14 implement multipage bvec in block layer: > > - put all tricks into bvec/bio/rq iterators, and as far as > drivers and fs use these standard iterators, they are happy > with multipage bvec > > - introduce bio_for_each_bvec() to iterate over multipage bvec for splitting > bio and mapping sg > > - keep current bio_for_each_segment*() to itereate over singlepage bvec and > make sure current users won't be broken; especailly, convert to this > new helper prototype in single patch 21 given it is bascially a mechanism > conversion > > - deal with iomap & xfs's sub-pagesize io vec in patch 13 > > - enalbe multipage bvec in patch 14 > > Patch 15 redefines BIO_MAX_PAGES as 256. > > Patch 16 documents usages of bio iterator helpers. > > Patch 17~18 kills NO_SG_MERGE. > > These patches can be found in the following git tree: > > git: https://github.com/ming1/linux.git v5.0-blk_mp_bvec_v14 ^^^ v15? > Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset, > and not see regression. > > Thanks Christoph for reviewing the early version and providing very good > suggestions, such as: introduce bio_init_with_vec_table(), remove another > unnecessary helpers for cleanup and so on. > > Thanks Chritoph and Omar for reviewing V10/V11/V12, and provides lots of > helpful comments. Applied, thanks Ming. Let's hope it sticks!
On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote: > On 2/15/19 4:13 AM, Ming Lei wrote: > > This patchset brings multi-page bvec into block layer: > > Applied, thanks Ming. Let's hope it sticks! Hi Jens and Ming, Test nvmeof-mp/002 fails with Jens' for-next branch from this morning. I have not yet tried to figure out which patch introduced the failure. Anyway, this is what I see in the kernel log for test nvmeof-mp/002: [ 475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 475.621188] #PF error: [normal kernel read fault] [ 475.623148] PGD 0 P4D 0 [ 475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN [ 475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G B 5.0.0-rc6-dbg+ #1 [ 475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 475.633855] Workqueue: kblockd blk_mq_requeue_work [ 475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590 [ 475.670948] Call Trace: [ 475.693515] blk_recalc_rq_segments+0x2f/0x50 [ 475.695081] blk_insert_cloned_request+0xbb/0x1c0 [ 475.701142] dm_mq_queue_rq+0x3d1/0x770 [ 475.707225] blk_mq_dispatch_rq_list+0x5fc/0xb10 [ 475.717137] blk_mq_sched_dispatch_requests+0x256/0x300 [ 475.721767] __blk_mq_run_hw_queue+0xd6/0x180 [ 475.725920] __blk_mq_delay_run_hw_queue+0x25c/0x290 [ 475.727480] blk_mq_run_hw_queue+0x119/0x1b0 [ 475.732019] blk_mq_run_hw_queues+0x7b/0xa0 [ 475.733468] blk_mq_requeue_work+0x2cb/0x300 [ 475.736473] process_one_work+0x4f1/0xa40 [ 475.739424] worker_thread+0x67/0x5b0 [ 475.741751] kthread+0x1cf/0x1f0 [ 475.746034] ret_from_fork+0x24/0x30 (gdb) list *(__blk_recalc_rq_segments+0xbe) 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366). 361 struct bio *bio) 362 { 363 struct bio_vec bv, bvprv = { NULL }; 364 int prev = 0; 365 unsigned int seg_size, nr_phys_segs; 366 unsigned front_seg_size = bio->bi_seg_front_size; 367 struct bio *fbio, *bbio; 368 struct bvec_iter iter; 369 370 if (!bio) Bart.
On 2/15/19 10:14 AM, Bart Van Assche wrote: > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote: >> On 2/15/19 4:13 AM, Ming Lei wrote: >>> This patchset brings multi-page bvec into block layer: >> >> Applied, thanks Ming. Let's hope it sticks! > > Hi Jens and Ming, > > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning. > I have not yet tried to figure out which patch introduced the failure. > Anyway, this is what I see in the kernel log for test nvmeof-mp/002: > > [ 475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > [ 475.621188] #PF error: [normal kernel read fault] > [ 475.623148] PGD 0 P4D 0 > [ 475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN > [ 475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G B 5.0.0-rc6-dbg+ #1 > [ 475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 475.633855] Workqueue: kblockd blk_mq_requeue_work > [ 475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590 > [ 475.670948] Call Trace: > [ 475.693515] blk_recalc_rq_segments+0x2f/0x50 > [ 475.695081] blk_insert_cloned_request+0xbb/0x1c0 > [ 475.701142] dm_mq_queue_rq+0x3d1/0x770 > [ 475.707225] blk_mq_dispatch_rq_list+0x5fc/0xb10 > [ 475.717137] blk_mq_sched_dispatch_requests+0x256/0x300 > [ 475.721767] __blk_mq_run_hw_queue+0xd6/0x180 > [ 475.725920] __blk_mq_delay_run_hw_queue+0x25c/0x290 > [ 475.727480] blk_mq_run_hw_queue+0x119/0x1b0 > [ 475.732019] blk_mq_run_hw_queues+0x7b/0xa0 > [ 475.733468] blk_mq_requeue_work+0x2cb/0x300 > [ 475.736473] process_one_work+0x4f1/0xa40 > [ 475.739424] worker_thread+0x67/0x5b0 > [ 475.741751] kthread+0x1cf/0x1f0 > [ 475.746034] ret_from_fork+0x24/0x30 > > (gdb) list *(__blk_recalc_rq_segments+0xbe) > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366). > 361 struct bio *bio) > 362 { > 363 struct bio_vec bv, bvprv = { NULL }; > 364 int prev = 0; > 365 unsigned int seg_size, nr_phys_segs; > 366 unsigned front_seg_size = bio->bi_seg_front_size; > 367 struct bio *fbio, *bbio; > 368 struct bvec_iter iter; > 369 > 370 if (!bio) Just ran a few tests, and it also seems to cause about a 5% regression in per-core IOPS throughput. Prior to this work, I could get 1620K 4k rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems to be blk_queue_split() and blk_rq_map_sg().
On Fri, Feb 15, 2019 at 03:51:26PM +0100, Christoph Hellwig wrote: > I still don't understand why mp_bvec_last_segment isn't simply > called bvec_last_segment as there is no conflict. But I don't > want to hold this series up on that as there only are two users > left and we can always just fix it up later. mp_bvec_last_segment() is one bvec helper, so better to keep its name consistent with other bvec helpers. Thanks, Ming
On Fri, Feb 15, 2019 at 09:14:15AM -0800, Bart Van Assche wrote: > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote: > > On 2/15/19 4:13 AM, Ming Lei wrote: > > > This patchset brings multi-page bvec into block layer: > > > > Applied, thanks Ming. Let's hope it sticks! > > Hi Jens and Ming, > > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning. > I have not yet tried to figure out which patch introduced the failure. > Anyway, this is what I see in the kernel log for test nvmeof-mp/002: > > [ 475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > [ 475.621188] #PF error: [normal kernel read fault] > [ 475.623148] PGD 0 P4D 0 > [ 475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN > [ 475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G B 5.0.0-rc6-dbg+ #1 > [ 475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 475.633855] Workqueue: kblockd blk_mq_requeue_work > [ 475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590 > [ 475.670948] Call Trace: > [ 475.693515] blk_recalc_rq_segments+0x2f/0x50 > [ 475.695081] blk_insert_cloned_request+0xbb/0x1c0 > [ 475.701142] dm_mq_queue_rq+0x3d1/0x770 > [ 475.707225] blk_mq_dispatch_rq_list+0x5fc/0xb10 > [ 475.717137] blk_mq_sched_dispatch_requests+0x256/0x300 > [ 475.721767] __blk_mq_run_hw_queue+0xd6/0x180 > [ 475.725920] __blk_mq_delay_run_hw_queue+0x25c/0x290 > [ 475.727480] blk_mq_run_hw_queue+0x119/0x1b0 > [ 475.732019] blk_mq_run_hw_queues+0x7b/0xa0 > [ 475.733468] blk_mq_requeue_work+0x2cb/0x300 > [ 475.736473] process_one_work+0x4f1/0xa40 > [ 475.739424] worker_thread+0x67/0x5b0 > [ 475.741751] kthread+0x1cf/0x1f0 > [ 475.746034] ret_from_fork+0x24/0x30 > > (gdb) list *(__blk_recalc_rq_segments+0xbe) > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366). > 361 struct bio *bio) > 362 { > 363 struct bio_vec bv, bvprv = { NULL }; > 364 int prev = 0; > 365 unsigned int seg_size, nr_phys_segs; > 366 unsigned front_seg_size = bio->bi_seg_front_size; > 367 struct bio *fbio, *bbio; > 368 struct bvec_iter iter; > 369 > 370 if (!bio) > > Bart. Thanks for your test! The following patch should fix this issue: diff --git a/block/blk-merge.c b/block/blk-merge.c index bed065904677..066b66430523 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, struct bio_vec bv, bvprv = { NULL }; int prev = 0; unsigned int seg_size, nr_phys_segs; - unsigned front_seg_size = bio->bi_seg_front_size; + unsigned front_seg_size; struct bio *fbio, *bbio; struct bvec_iter iter; if (!bio) return 0; + front_seg_size = bio->bi_seg_front_size; + switch (bio_op(bio)) { case REQ_OP_DISCARD: case REQ_OP_SECURE_ERASE: Thanks, Ming
On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote: > On 2/15/19 10:14 AM, Bart Van Assche wrote: > > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote: > >> On 2/15/19 4:13 AM, Ming Lei wrote: > >>> This patchset brings multi-page bvec into block layer: > >> > >> Applied, thanks Ming. Let's hope it sticks! > > > > Hi Jens and Ming, > > > > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning. > > I have not yet tried to figure out which patch introduced the failure. > > Anyway, this is what I see in the kernel log for test nvmeof-mp/002: > > > > [ 475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > > [ 475.621188] #PF error: [normal kernel read fault] > > [ 475.623148] PGD 0 P4D 0 > > [ 475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN > > [ 475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G B 5.0.0-rc6-dbg+ #1 > > [ 475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > > [ 475.633855] Workqueue: kblockd blk_mq_requeue_work > > [ 475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590 > > [ 475.670948] Call Trace: > > [ 475.693515] blk_recalc_rq_segments+0x2f/0x50 > > [ 475.695081] blk_insert_cloned_request+0xbb/0x1c0 > > [ 475.701142] dm_mq_queue_rq+0x3d1/0x770 > > [ 475.707225] blk_mq_dispatch_rq_list+0x5fc/0xb10 > > [ 475.717137] blk_mq_sched_dispatch_requests+0x256/0x300 > > [ 475.721767] __blk_mq_run_hw_queue+0xd6/0x180 > > [ 475.725920] __blk_mq_delay_run_hw_queue+0x25c/0x290 > > [ 475.727480] blk_mq_run_hw_queue+0x119/0x1b0 > > [ 475.732019] blk_mq_run_hw_queues+0x7b/0xa0 > > [ 475.733468] blk_mq_requeue_work+0x2cb/0x300 > > [ 475.736473] process_one_work+0x4f1/0xa40 > > [ 475.739424] worker_thread+0x67/0x5b0 > > [ 475.741751] kthread+0x1cf/0x1f0 > > [ 475.746034] ret_from_fork+0x24/0x30 > > > > (gdb) list *(__blk_recalc_rq_segments+0xbe) > > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366). > > 361 struct bio *bio) > > 362 { > > 363 struct bio_vec bv, bvprv = { NULL }; > > 364 int prev = 0; > > 365 unsigned int seg_size, nr_phys_segs; > > 366 unsigned front_seg_size = bio->bi_seg_front_size; > > 367 struct bio *fbio, *bbio; > > 368 struct bvec_iter iter; > > 369 > > 370 if (!bio) > > Just ran a few tests, and it also seems to cause about a 5% regression > in per-core IOPS throughput. Prior to this work, I could get 1620K 4k > rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems > to be blk_queue_split() and blk_rq_map_sg(). Could you share us your test setting? I will run null_blk first and see if it can be reproduced. Thanks, Ming
On Sun, Feb 17, 2019 at 09:13:32PM +0800, Ming Lei wrote: > On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote: > > On 2/15/19 10:14 AM, Bart Van Assche wrote: > > > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote: > > >> On 2/15/19 4:13 AM, Ming Lei wrote: > > >>> This patchset brings multi-page bvec into block layer: > > >> > > >> Applied, thanks Ming. Let's hope it sticks! > > > > > > Hi Jens and Ming, > > > > > > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning. > > > I have not yet tried to figure out which patch introduced the failure. > > > Anyway, this is what I see in the kernel log for test nvmeof-mp/002: > > > > > > [ 475.611363] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 > > > [ 475.621188] #PF error: [normal kernel read fault] > > > [ 475.623148] PGD 0 P4D 0 > > > [ 475.624737] Oops: 0000 [#1] PREEMPT SMP KASAN > > > [ 475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: G B 5.0.0-rc6-dbg+ #1 > > > [ 475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > > > [ 475.633855] Workqueue: kblockd blk_mq_requeue_work > > > [ 475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590 > > > [ 475.670948] Call Trace: > > > [ 475.693515] blk_recalc_rq_segments+0x2f/0x50 > > > [ 475.695081] blk_insert_cloned_request+0xbb/0x1c0 > > > [ 475.701142] dm_mq_queue_rq+0x3d1/0x770 > > > [ 475.707225] blk_mq_dispatch_rq_list+0x5fc/0xb10 > > > [ 475.717137] blk_mq_sched_dispatch_requests+0x256/0x300 > > > [ 475.721767] __blk_mq_run_hw_queue+0xd6/0x180 > > > [ 475.725920] __blk_mq_delay_run_hw_queue+0x25c/0x290 > > > [ 475.727480] blk_mq_run_hw_queue+0x119/0x1b0 > > > [ 475.732019] blk_mq_run_hw_queues+0x7b/0xa0 > > > [ 475.733468] blk_mq_requeue_work+0x2cb/0x300 > > > [ 475.736473] process_one_work+0x4f1/0xa40 > > > [ 475.739424] worker_thread+0x67/0x5b0 > > > [ 475.741751] kthread+0x1cf/0x1f0 > > > [ 475.746034] ret_from_fork+0x24/0x30 > > > > > > (gdb) list *(__blk_recalc_rq_segments+0xbe) > > > 0xffffffff816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366). > > > 361 struct bio *bio) > > > 362 { > > > 363 struct bio_vec bv, bvprv = { NULL }; > > > 364 int prev = 0; > > > 365 unsigned int seg_size, nr_phys_segs; > > > 366 unsigned front_seg_size = bio->bi_seg_front_size; > > > 367 struct bio *fbio, *bbio; > > > 368 struct bvec_iter iter; > > > 369 > > > 370 if (!bio) > > > > Just ran a few tests, and it also seems to cause about a 5% regression > > in per-core IOPS throughput. Prior to this work, I could get 1620K 4k > > rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems > > to be blk_queue_split() and blk_rq_map_sg(). > > Could you share us your test setting? > > I will run null_blk first and see if it can be reproduced. Looks this performance drop isn't reproduced on null_blk with the following setting by me: - modprobe null_blk nr_devices=4 submit_queues=48 - test machine : dual socket, two NUMA nodes, 24cores/socket - fio script: fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=48 --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 --name=randread --rw=randread result: 10.7M IOPS(base kernel), 10.6M IOPS(patched kernel) And if 'bs' is increased to 256k, 512k, 1024k, IOPS improvement can be ~8% with multi-page bvec patches in above test. BTW, there isn't cost added to bio_for_each_bvec(), so blk_queue_split() and blk_rq_map_sg() should be fine. However, bio_for_each_segment_all() may not be quick as before. Thanks, Ming
On Sun, 2019-02-17 at 21:11 +0800, Ming Lei wrote: > The following patch should fix this issue: > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > index bed065904677..066b66430523 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, > struct bio_vec bv, bvprv = { NULL }; > int prev = 0; > unsigned int seg_size, nr_phys_segs; > - unsigned front_seg_size = bio->bi_seg_front_size; > + unsigned front_seg_size; > struct bio *fbio, *bbio; > struct bvec_iter iter; > > if (!bio) > return 0; > > + front_seg_size = bio->bi_seg_front_size; > + > switch (bio_op(bio)) { > case REQ_OP_DISCARD: > case REQ_OP_SECURE_ERASE: Hi Ming, With this patch applied test nvmeof-mp/002 fails as follows: [ 694.700400] kernel BUG at lib/sg_pool.c:103! [ 694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G B 5.0.0-rc6-dbg+ #2 [ 694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 694.715113] Workqueue: kblockd blk_mq_run_work_fn [ 694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0 [ 694.758222] Call Trace: [ 694.759645] nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma] [ 694.764915] blk_mq_try_issue_directly+0x2a5/0x4b0 [ 694.771779] blk_insert_cloned_request+0x11e/0x1c0 [ 694.778417] dm_mq_queue_rq+0x3d1/0x770 [ 694.793400] blk_mq_dispatch_rq_list+0x5fc/0xb10 [ 694.798386] blk_mq_sched_dispatch_requests+0x2f7/0x300 [ 694.803180] __blk_mq_run_hw_queue+0xd6/0x180 [ 694.808933] blk_mq_run_work_fn+0x27/0x30 [ 694.810315] process_one_work+0x4f1/0xa40 [ 694.813178] worker_thread+0x67/0x5b0 [ 694.814487] kthread+0x1cf/0x1f0 [ 694.819134] ret_from_fork+0x24/0x30 The code in sg_pool.c that triggers the BUG() statement is as follows: int sg_alloc_table_chained(struct sg_table *table, int nents, struct scatterlist *first_chunk) { int ret; BUG_ON(!nents); [ ... ] Bart.
On Tue, Feb 19, 2019 at 08:28:19AM -0800, Bart Van Assche wrote: > On Sun, 2019-02-17 at 21:11 +0800, Ming Lei wrote: > > The following patch should fix this issue: > > > > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > > index bed065904677..066b66430523 100644 > > --- a/block/blk-merge.c > > +++ b/block/blk-merge.c > > @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct request_queue *q, > > struct bio_vec bv, bvprv = { NULL }; > > int prev = 0; > > unsigned int seg_size, nr_phys_segs; > > - unsigned front_seg_size = bio->bi_seg_front_size; > > + unsigned front_seg_size; > > struct bio *fbio, *bbio; > > struct bvec_iter iter; > > > > if (!bio) > > return 0; > > > > + front_seg_size = bio->bi_seg_front_size; > > + > > switch (bio_op(bio)) { > > case REQ_OP_DISCARD: > > case REQ_OP_SECURE_ERASE: > > Hi Ming, > > With this patch applied test nvmeof-mp/002 fails as follows: > > [ 694.700400] kernel BUG at lib/sg_pool.c:103! > [ 694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN > [ 694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G B 5.0.0-rc6-dbg+ #2 > [ 694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 694.715113] Workqueue: kblockd blk_mq_run_work_fn > [ 694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0 > [ 694.758222] Call Trace: > [ 694.759645] nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma] > [ 694.764915] blk_mq_try_issue_directly+0x2a5/0x4b0 > [ 694.771779] blk_insert_cloned_request+0x11e/0x1c0 > [ 694.778417] dm_mq_queue_rq+0x3d1/0x770 > [ 694.793400] blk_mq_dispatch_rq_list+0x5fc/0xb10 > [ 694.798386] blk_mq_sched_dispatch_requests+0x2f7/0x300 > [ 694.803180] __blk_mq_run_hw_queue+0xd6/0x180 > [ 694.808933] blk_mq_run_work_fn+0x27/0x30 > [ 694.810315] process_one_work+0x4f1/0xa40 > [ 694.813178] worker_thread+0x67/0x5b0 > [ 694.814487] kthread+0x1cf/0x1f0 > [ 694.819134] ret_from_fork+0x24/0x30 > > The code in sg_pool.c that triggers the BUG() statement is as follows: > > int sg_alloc_table_chained(struct sg_table *table, int nents, > struct scatterlist *first_chunk) > { > int ret; > > BUG_ON(!nents); > [ ... ] > > Bart. I can reproduce this issue("kernel BUG at lib/sg_pool.c:103") without mp-bvec patches, so looks it isn't the fault of this patchset. Thanks, Ming
On 2/19/19 5:17 PM, Ming Lei wrote: > On Tue, Feb 19, 2019 at 08:28:19AM -0800, Bart Van Assche wrote: >> With this patch applied test nvmeof-mp/002 fails as follows: >> >> [ 694.700400] kernel BUG at lib/sg_pool.c:103! >> [ 694.705932] invalid opcode: 0000 [#1] PREEMPT SMP KASAN >> [ 694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: G B 5.0.0-rc6-dbg+ #2 >> [ 694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 >> [ 694.715113] Workqueue: kblockd blk_mq_run_work_fn >> [ 694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0 >> [ 694.758222] Call Trace: >> [ 694.759645] nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma] >> [ 694.764915] blk_mq_try_issue_directly+0x2a5/0x4b0 >> [ 694.771779] blk_insert_cloned_request+0x11e/0x1c0 >> [ 694.778417] dm_mq_queue_rq+0x3d1/0x770 >> [ 694.793400] blk_mq_dispatch_rq_list+0x5fc/0xb10 >> [ 694.798386] blk_mq_sched_dispatch_requests+0x2f7/0x300 >> [ 694.803180] __blk_mq_run_hw_queue+0xd6/0x180 >> [ 694.808933] blk_mq_run_work_fn+0x27/0x30 >> [ 694.810315] process_one_work+0x4f1/0xa40 >> [ 694.813178] worker_thread+0x67/0x5b0 >> [ 694.814487] kthread+0x1cf/0x1f0 >> [ 694.819134] ret_from_fork+0x24/0x30 >> >> The code in sg_pool.c that triggers the BUG() statement is as follows: >> >> int sg_alloc_table_chained(struct sg_table *table, int nents, >> struct scatterlist *first_chunk) >> { >> int ret; >> >> BUG_ON(!nents); >> [ ... ] >> >> Bart. > > I can reproduce this issue("kernel BUG at lib/sg_pool.c:103") without mp-bvec patches, > so looks it isn't the fault of this patchset. Thanks Ming for your feedback. Jens, I don't see that issue with kernel v5.0-rc6. Does that mean that the sg_pool BUG() is a regression in your for-next branch that predates Ming's multi-page bvec patch series? Thanks, Bart.