Message ID | 20210826144039.2143-1-thunder.leizhen@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block/mq-deadline: Speed up the dispatch of low-priority requests | expand |
On 8/26/21 7:40 AM, Zhen Lei wrote: > lock protection needs to be added only in > dd_finish_request(), which is unlikely to cause significant performance > side effects. Not sure the above is correct. Every new atomic instruction has a measurable performance overhead. But I guess in this case that overhead is smaller than the time needed to sum 128 per-CPU variables. > Tested on my 128-core board with two ssd disks. > fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others> > Before: > [183K/0/0 iops] > [172K/0/0 iops] > > After: > [258K/0/0 iops] > [258K/0/0 iops] Nice work! > Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests") Shouldn't the Fixes: tag be used only for patches that modify functionality? I'm not sure it is appropriate to use this tag for performance improvements. > struct deadline_data { > @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio, > } > > /* Number of requests queued for a given priority level. */ > -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) > +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) > { > - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio); > + return dd->per_prio[prio].nr_queued; > } Please leave out "__always_inline". Modern compilers are smart enough to inline this function without using the "inline" keyword. > @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, > > prio = ioprio_class_to_prio[ioprio_class]; > dd_count(dd, inserted, prio); > + per_prio = &dd->per_prio[prio]; > + per_prio->nr_queued++; > > if (blk_mq_sched_try_insert_merge(q, rq, &free)) { > blk_mq_free_requests(&free); I think the above is wrong - nr_queued should not be incremented if the request is merged into another request. Please move the code that increments nr_queued past the above if-statement. Thanks, Bart.
On 8/26/21 12:09 PM, Bart Van Assche wrote: > On 8/26/21 7:40 AM, Zhen Lei wrote: >> lock protection needs to be added only in dd_finish_request(), which >> is unlikely to cause significant performance side effects. > > Not sure the above is correct. Every new atomic instruction has a > measurable performance overhead. But I guess in this case that > overhead is smaller than the time needed to sum 128 per-CPU variables. perpcu counters only really work, if the summing is not in a hot path, or if the summing is just some "not zero" thing instead of a full sum. They just don't scale at all for even moderately sized systems. >> Tested on my 128-core board with two ssd disks. >> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others> >> Before: >> [183K/0/0 iops] >> [172K/0/0 iops] >> >> After: >> [258K/0/0 iops] >> [258K/0/0 iops] > > Nice work! > >> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests") > > Shouldn't the Fixes: tag be used only for patches that modify > functionality? I'm not sure it is appropriate to use this tag for > performance improvements. For a regression this big, I think it's the right thing. Anyone that may backport the original commit definitely should also get the followup fix. This isn't just a performance improvement, it's fixing a big performance regression.
On 8/26/21 12:13 PM, Jens Axboe wrote: > On 8/26/21 12:09 PM, Bart Van Assche wrote: >> On 8/26/21 7:40 AM, Zhen Lei wrote: >>> lock protection needs to be added only in dd_finish_request(), which >>> is unlikely to cause significant performance side effects. >> >> Not sure the above is correct. Every new atomic instruction has a >> measurable performance overhead. But I guess in this case that >> overhead is smaller than the time needed to sum 128 per-CPU variables. > > perpcu counters only really work, if the summing is not in a hot path, > or if the summing is just some "not zero" thing instead of a full sum. > They just don't scale at all for even moderately sized systems. Ugh it's actually even worse in this case, since you do: static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) { return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio); } which ends up iterating possible CPUs _twice_! Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching to deadline, of which 37% of the overhead is from dd_dispatch(). With the posted patch applied, it runs at 2.3M IOPS with mq-deadline, which is a lot better. This is on my 3970X test box, so 32 cores, 64 threads. Bart, either we fix this up ASAP and get rid of the percpu counters in the hot path, or we revert this patch.
On 8/26/21 11:45 AM, Jens Axboe wrote: > Bart, either we fix this up ASAP and get rid of the percpu counters in > the hot path, or we revert this patch. I'm at home taking care of my son because he is sick. Do you want to wait for Zhen to post a fixed version of his patch or do you want me to fix his patch? Bart.
On 8/26/21 1:17 PM, Bart Van Assche wrote: > On 8/26/21 11:45 AM, Jens Axboe wrote: >> Bart, either we fix this up ASAP and get rid of the percpu counters in >> the hot path, or we revert this patch. > > I'm at home taking care of my son because he is sick. Take care of that, always more important than the kernel :-) > Do you want to wait for Zhen to post a fixed version of his patch or do you > want me to fix his patch? To be a bit proactive, I queued up a revert. Messing with it at this point is making me a bit nervous. If the final patch ends up looking clean enough, then we can still go that route. But it really has to be unequivocally done this late in the game...
On 8/26/21 11:45 AM, Jens Axboe wrote: > Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching > to deadline, of which 37% of the overhead is from dd_dispatch(). > > With the posted patch applied, it runs at 2.3M IOPS with mq-deadline, > which is a lot better. This is on my 3970X test box, so 32 cores, 64 > threads. Hi Jens, With the script below, queue depth >= 2 and an improved version of Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an 8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen reported with fewer CPU cores. Is that good enough? Thanks, Bart. #!/bin/bash if [ -e /sys/kernel/config/nullb ]; then for d in /sys/kernel/config/nullb/*; do [ -d "$d" ] && rmdir "$d" done fi numcpus=$(grep -c ^processor /proc/cpuinfo) modprobe -r null_blk [ -e /sys/module/null_blk ] && exit $? modprobe null_blk nr_devices=0 && udevadm settle && cd /sys/kernel/config/nullb && mkdir nullb0 && cd nullb0 && echo 0 > completion_nsec && echo 512 > blocksize && echo 0 > home_node && echo 0 > irqmode && echo 1024 > size && echo 0 > memory_backed && echo 2 > queue_mode && echo 1 > power || exit $? ( cd /sys/block/nullb0/queue && echo 2 > rq_affinity ) || exit $? iodepth=${1:-1} runtime=30 args=() if [ "$iodepth" = 1 ]; then args+=(--ioengine=psync) else args+=(--ioengine=io_uring --iodepth_batch=$((iodepth/2))) fi args+=(--iodepth=$iodepth --name=nullb0 --filename=/dev/nullb0\ --rw=read --bs=512 --loops=$((1<<20)) --direct=1 --numjobs=$numcpus \ --thread --runtime=$runtime --invalidate=1 --gtod_reduce=1 \ --group_reporting=1 --ioscheduler=mq-deadline) if numactl -m 0 -N 0 echo >&/dev/null; then numactl -m 0 -N 0 -- fio "${args[@]}" else fio "${args[@]}" fi
On 8/26/21 5:49 PM, Bart Van Assche wrote: > On 8/26/21 11:45 AM, Jens Axboe wrote: >> Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching >> to deadline, of which 37% of the overhead is from dd_dispatch(). >> >> With the posted patch applied, it runs at 2.3M IOPS with mq-deadline, >> which is a lot better. This is on my 3970X test box, so 32 cores, 64 >> threads. > > Hi Jens, > > With the script below, queue depth >= 2 and an improved version of > Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an > 8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen > reported with fewer CPU cores. Is that good enough? That depends, what kind of IOPS are you getting if you revert the original change?
On 8/26/21 4:51 PM, Jens Axboe wrote: > On 8/26/21 5:49 PM, Bart Van Assche wrote: >> On 8/26/21 11:45 AM, Jens Axboe wrote: >>> Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching >>> to deadline, of which 37% of the overhead is from dd_dispatch(). >>> >>> With the posted patch applied, it runs at 2.3M IOPS with mq-deadline, >>> which is a lot better. This is on my 3970X test box, so 32 cores, 64 >>> threads. >> >> Hi Jens, >> >> With the script below, queue depth >= 2 and an improved version of >> Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an >> 8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen >> reported with fewer CPU cores. Is that good enough? > > That depends, what kind of IOPS are you getting if you revert the > original change? Hi Jens, Here is an overview of the tests I ran so far, all on the same test setup: * No I/O scheduler: about 5630 K IOPS. * Kernel v5.11 + mq-deadline: about 1100 K IOPS. * block-for-next + mq-deadline: about 760 K IOPS. * block-for-next with improved mq-deadline performance: about 970 K IOPS. Bart.
On 8/26/21 6:03 PM, Bart Van Assche wrote: > On 8/26/21 4:51 PM, Jens Axboe wrote: >> On 8/26/21 5:49 PM, Bart Van Assche wrote: >>> On 8/26/21 11:45 AM, Jens Axboe wrote: >>>> Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching >>>> to deadline, of which 37% of the overhead is from dd_dispatch(). >>>> >>>> With the posted patch applied, it runs at 2.3M IOPS with mq-deadline, >>>> which is a lot better. This is on my 3970X test box, so 32 cores, 64 >>>> threads. >>> >>> Hi Jens, >>> >>> With the script below, queue depth >= 2 and an improved version of >>> Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an >>> 8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen >>> reported with fewer CPU cores. Is that good enough? >> >> That depends, what kind of IOPS are you getting if you revert the >> original change? > > Hi Jens, > > Here is an overview of the tests I ran so far, all on the same test > setup: > * No I/O scheduler: about 5630 K IOPS. > * Kernel v5.11 + mq-deadline: about 1100 K IOPS. > * block-for-next + mq-deadline: about 760 K IOPS. > * block-for-next with improved mq-deadline performance: about 970 K IOPS. So we're still off by about 12%, I don't think that is good enough. That's assuming that v5.11 + mq-deadline is the same as for-next with the mq-deadline change reverted? Because that would be the key number to compare it with.
On 8/26/21 5:05 PM, Jens Axboe wrote: >> Here is an overview of the tests I ran so far, all on the same test >> setup: >> * No I/O scheduler: about 5630 K IOPS. >> * Kernel v5.11 + mq-deadline: about 1100 K IOPS. >> * block-for-next + mq-deadline: about 760 K IOPS. >> * block-for-next with improved mq-deadline performance: about 970 K IOPS. > > So we're still off by about 12%, I don't think that is good enough. > That's assuming that v5.11 + mq-deadline is the same as for-next with > the mq-deadline change reverted? Because that would be the key number to > compare it with. A quick attempt to eliminate the ktime_get_ns() call from __dd_dispatch_request() improved performance a few percent but not as much as I was hoping. I need a few days of time to run these measurements, to optimize performance further and to rerun all functional and performance tests. Thanks, Bart.
On Thu, 2021-08-26 at 22:40 +0800, Zhen Lei wrote: > dd_queued() traverses the percpu variable for summation. The more cores, > the higher the performance overhead. I currently have a 128-core board and > this function takes 2.5 us. If the number of high-priority requests is > small and the number of low- and medium-priority requests is large, the > performance impact is significant. > > Let's maintain a non-percpu member variable 'nr_queued', which is > incremented by 1 immediately following "inserted++" and decremented by 1 > immediately following "completed++". Because both the judgment dd_queued() > in dd_dispatch_request() and operation "inserted++" in dd_insert_request() > are protected by dd->lock, lock protection needs to be added only in > dd_finish_request(), which is unlikely to cause significant performance > side effects. > > Tested on my 128-core board with two ssd disks. > fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others> > Before: > [183K/0/0 iops] > [172K/0/0 iops] > > After: > [258K/0/0 iops] > [258K/0/0 iops] > > Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests") > Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> > --- > block/mq-deadline.c | 14 +++++++++----- > 1 file changed, 9 insertions(+), 5 deletions(-) > > diff --git a/block/mq-deadline.c b/block/mq-deadline.c > index a09761cbdf12e58..d8f6aa12de80049 100644 > --- a/block/mq-deadline.c > +++ b/block/mq-deadline.c > @@ -79,6 +79,7 @@ struct dd_per_prio { > struct list_head fifo_list[DD_DIR_COUNT]; > /* Next request in FIFO order. Read, write or both are NULL. */ > struct request *next_rq[DD_DIR_COUNT]; > + unsigned int nr_queued; > }; > > struct deadline_data { > @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio, > } > > /* Number of requests queued for a given priority level. */ > -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) > +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) > { > - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio); > + return dd->per_prio[prio].nr_queued; > } > > /* > @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, > > prio = ioprio_class_to_prio[ioprio_class]; > dd_count(dd, inserted, prio); > + per_prio = &dd->per_prio[prio]; > + per_prio->nr_queued++; > > if (blk_mq_sched_try_insert_merge(q, rq, &free)) { > blk_mq_free_requests(&free); > @@ -719,7 +722,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, > > trace_block_rq_insert(rq); > > - per_prio = &dd->per_prio[prio]; > if (at_head) { > list_add(&rq->queuelist, &per_prio->dispatch); > } else { > @@ -790,12 +792,14 @@ static void dd_finish_request(struct request *rq) > const u8 ioprio_class = dd_rq_ioclass(rq); > const enum dd_prio prio = ioprio_class_to_prio[ioprio_class]; > struct dd_per_prio *per_prio = &dd->per_prio[prio]; > + unsigned long flags; > > dd_count(dd, completed, prio); > + spin_lock_irqsave(&dd->lock, flags); > + per_prio->nr_queued--; > + spin_unlock_irqrestore(&dd->lock, flags); dd->lock is not taken with irqsave everywhere else. This leads to hard lockups which I hit right away on boot. To avoid this, we need a spin_lock_irqsave() everywhere. Of note is that without this patch, testing on nullblk with Bart's script on 5.14.0-rc7, I get this splat: [ 198.726920] watchdog: BUG: soft lockup - CPU#20 stuck for 26s! [kworker/20:1H:260] [ 198.734550] Modules linked in: null_blk rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink sunrpc vfat fat iTCO_wdt iTCO_vendor_support ipmi_ssif x86_pkg_temp_thermal acpi_ipmi coretemp ipmi_si ioatdma i2c_i801 bfq i2c_smbus lpc_ich intel_pch_thermal dca ipmi_devintf ipmi_msghandler acpi_power_meter fuse ip_tables sd_mod ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm i40e crct10dif_pclmul mpt3sas crc32_pclmul ahci ghash_clmulni_intel libahci libata raid_class scsi_transport_sas pkcs8_key_parser [ 198.805375] irq event stamp: 25378690 [ 198.809063] hardirqs last enabled at (25378689): [<ffffffff81149959>] ktime_get+0x109/0x120 [ 198.817545] hardirqs last disabled at (25378690): [<ffffffff8190519b>] sysvec_apic_timer_interrupt+0xb/0x90 [ 198.827327] softirqs last enabled at (25337302): [<ffffffff810b331f>] __irq_exit_rcu+0xbf/0xe0 [ 198.836066] softirqs last disabled at (25337297): [<ffffffff810b331f>] __irq_exit_rcu+0xbf/0xe0 [ 198.844802] CPU: 20 PID: 260 Comm: kworker/20:1H Not tainted 5.14.0-rc7+ #1324 [ 198.852059] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3 02/21/2020 [ 198.859487] Workqueue: kblockd blk_mq_run_work_fn [ 198.864222] RIP: 0010:__list_add_valid+0x33/0x40 [ 198.868868] Code: f2 0f 85 ec 44 44 00 4c 8b 0a 4d 39 c1 0f 85 08 45 44 00 48 39 d7 0f 84 e8 44 44 00 4c 39 cf 0f 84 df 44 44 00 b8 01 00 00 00 <c3> 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 8b 17 4c 8b 47 08 48 b8 [ 198.887712] RSP: 0018:ffff8883f1337d68 EFLAGS: 00000206 [ 198.892963] RAX: 0000000000000001 RBX: ffff88857dae0840 RCX: 0000000000000000 [ 198.900132] RDX: ffff8885387e2bc8 RSI: ffff8885387e2bc8 RDI: ffff8885387e2d48 [ 198.907300] RBP: ffff8883f1337d90 R08: ffff8883f1337d90 R09: ffff8883f1337d90 [ 198.914467] R10: 0000000000000020 R11: 0000000000000001 R12: 000000000000000a [ 198.921632] R13: ffff88857dae0800 R14: ffff8885bd3f3400 R15: ffff8885bd276200 [ 198.928801] FS: 0000000000000000(0000) GS:ffff888860100000(0000) knlGS:0000000000000000 [ 198.936929] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 198.942703] CR2: 0000000002204440 CR3: 0000000107322004 CR4: 00000000007706e0 [ 198.949871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 198.957036] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 198.964203] PKRU: 55555554 [ 198.966933] Call Trace: [ 198.969401] __blk_mq_do_dispatch_sched+0x234/0x2f0 [ 198.974314] __blk_mq_sched_dispatch_requests+0xf4/0x140 [ 198.979662] blk_mq_sched_dispatch_requests+0x30/0x60 [ 198.984744] __blk_mq_run_hw_queue+0x49/0x90 [ 198.989041] process_one_work+0x26c/0x570 [ 198.993083] worker_thread+0x55/0x3c0 [ 198.996776] ? process_one_work+0x570/0x570 [ 199.000993] kthread+0x140/0x160 [ 199.004243] ? set_kthread_struct+0x40/0x40 [ 199.008452] ret_from_fork+0x1f/0x30 > > if (blk_queue_is_zoned(q)) { > - unsigned long flags; > - > spin_lock_irqsave(&dd->zone_lock, flags); > blk_req_zone_write_unlock(rq); > if (!list_empty(&per_prio->fifo_list[DD_WRITE]))
On 8/26/21 5:05 PM, Jens Axboe wrote: > On 8/26/21 6:03 PM, Bart Van Assche wrote: >> Here is an overview of the tests I ran so far, all on the same test >> setup: >> * No I/O scheduler: about 5630 K IOPS. >> * Kernel v5.11 + mq-deadline: about 1100 K IOPS. >> * block-for-next + mq-deadline: about 760 K IOPS. >> * block-for-next with improved mq-deadline performance: about 970 K IOPS. > > So we're still off by about 12%, I don't think that is good enough. > That's assuming that v5.11 + mq-deadline is the same as for-next with > the mq-deadline change reverted? Because that would be the key number to > compare it with. With the patch series that is available at https://github.com/bvanassche/linux/tree/block-for-next the same test reports 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the linux-block mailing list after I have finished testing that series. Bart.
On 8/26/21 8:48 PM, Bart Van Assche wrote: > On 8/26/21 5:05 PM, Jens Axboe wrote: >> On 8/26/21 6:03 PM, Bart Van Assche wrote: >>> Here is an overview of the tests I ran so far, all on the same test >>> setup: >>> * No I/O scheduler: about 5630 K IOPS. >>> * Kernel v5.11 + mq-deadline: about 1100 K IOPS. >>> * block-for-next + mq-deadline: about 760 K IOPS. >>> * block-for-next with improved mq-deadline performance: about 970 K IOPS. >> >> So we're still off by about 12%, I don't think that is good enough. >> That's assuming that v5.11 + mq-deadline is the same as for-next with >> the mq-deadline change reverted? Because that would be the key number to >> compare it with. > > With the patch series that is available at > https://github.com/bvanassche/linux/tree/block-for-next the same test reports > 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the > linux-block mailing list after I have finished testing that series. OK sounds good. I do think we should just do the revert at this point, any real fix is going to end up being bigger than I'd like at this point. Then we can re-introduce the feature once we're happy with the results.
On 2021/08/27 12:13, Jens Axboe wrote: > On 8/26/21 8:48 PM, Bart Van Assche wrote: >> On 8/26/21 5:05 PM, Jens Axboe wrote: >>> On 8/26/21 6:03 PM, Bart Van Assche wrote: >>>> Here is an overview of the tests I ran so far, all on the same test >>>> setup: >>>> * No I/O scheduler: about 5630 K IOPS. >>>> * Kernel v5.11 + mq-deadline: about 1100 K IOPS. >>>> * block-for-next + mq-deadline: about 760 K IOPS. >>>> * block-for-next with improved mq-deadline performance: about 970 K IOPS. >>> >>> So we're still off by about 12%, I don't think that is good enough. >>> That's assuming that v5.11 + mq-deadline is the same as for-next with >>> the mq-deadline change reverted? Because that would be the key number to >>> compare it with. >> >> With the patch series that is available at >> https://github.com/bvanassche/linux/tree/block-for-next the same test reports >> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the >> linux-block mailing list after I have finished testing that series. > > OK sounds good. I do think we should just do the revert at this point, > any real fix is going to end up being bigger than I'd like at this > point. Then we can re-introduce the feature once we're happy with the > results. FYI, here is what I get with Bart's test script running on a dual socket 8-cores/16-threads Xeon machine (32 CPUs total): * 5.14.0-rc7, with fb926032b320 reverted: ----------------------------------------- QD 1: IOPS=305k (*) QD 2: IOPS=411k QD 4: IOPS=408k QD 8: IOPS=414k * 5.14.0-rc7, current (no modification): ---------------------------------------- QD 1: IOPS=296k (*) QD 2: IOPS=207k QD 4: IOPS=208k QD 8: IOPS=210k * 5.14.0-rc7, with modified patch (attached to this email): ----------------------------------------------------------- QD 1: IOPS=287k (*) QD 2: IOPS=334k QD 4: IOPS=330k QD 8: IOPS=334k For reference, with the same test script using the none scheduler: QD 1: IOPS=2172K QD 2: IOPS=1075K QD 4: IOPS=1075k QD 8: IOPS=1077k So the mq-deadline priority patch reduces performance by nearly half at high QD. With the modified patch, we are back to better numbers, but still a significant 20% drop at high QD. (*) Note: in all cases using the mq-deadline scheduler, for the first run at QD=1, I get this splat 100% of the time. [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] [ 95.181351] Modules linked in: null_blk rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c nfnetlink sunrpc vfat fat iTCO_wdt iTCO_vendor_support ipmi_ssif x86_pkg_temp_thermal coretemp i2c_i801 acpi_ipmi bfq i2c_smbus ioatdma lpc_ich ipmi_si intel_pch_thermal dca ipmi_devintf ipmi_msghandler acpi_power_meter fuse ip_tables sd_mod ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm i40e crct10dif_pclmul mpt3sas crc32_pclmul ahci ghash_clmulni_intel libahci raid_class scsi_transport_sas libata pkcs8_key_parser [ 95.252173] irq event stamp: 30500990 [ 95.255860] hardirqs last enabled at (30500989): [<ffffffff81910e2d>] _raw_spin_unlock_irqrestore+0x2d/0x40 [ 95.265735] hardirqs last disabled at (30500990): [<ffffffff819050cb>] sysvec_apic_timer_interrupt+0xb/0x90 [ 95.275520] softirqs last enabled at (30496338): [<ffffffff810b331f>] __irq_exit_rcu+0xbf/0xe0 [ 95.284259] softirqs last disabled at (30496333): [<ffffffff810b331f>] __irq_exit_rcu+0xbf/0xe0 [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 [ 95.300076] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3 02/21/2020 [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 [ 95.317844] Code: c7 18 53 48 89 f3 48 8b 74 24 10 e8 35 82 80 ff 48 89 ef e8 9d ac 80 ff 80 e7 02 74 06 e8 23 33 8b ff fb 65 ff 0d 8b 5f 70 7e <5b> 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 fd 65 ff [ 95.336680] RSP: 0018:ffff888448cefbb0 EFLAGS: 00000202 [ 95.341934] RAX: 0000000001d1687d RBX: 0000000000000287 RCX: 0000000000000006 [ 95.349103] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81910e2d [ 95.356270] RBP: ffff888192649218 R08: 0000000000000001 R09: 0000000000000001 [ 95.363437] R10: 0000000000000000 R11: 000000000000005c R12: 0000000000000000 [ 95.370604] R13: 0000000000000287 R14: ffff888192649218 R15: ffff88885fe68e80 [ 95.377771] FS: 0000000000000000(0000) GS:ffff88885fe00000(0000) knlGS:0000000000000000 [ 95.385901] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 95.391675] CR2: 00007f59bfe71f80 CR3: 000000074a91e005 CR4: 00000000007706f0 [ 95.398842] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 95.406009] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 95.413176] PKRU: 55555554 [ 95.415904] Call Trace: [ 95.418373] try_to_wake_up+0x268/0x7c0 [ 95.422238] blk_update_request+0x25b/0x420 [ 95.426452] blk_mq_end_request+0x1c/0x120 [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 [ 95.440130] ? sbitmap_get+0x86/0x190 [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 [ 95.463377] process_one_work+0x26c/0x570 [ 95.467421] worker_thread+0x55/0x3c0 [ 95.471103] ? process_one_work+0x570/0x570 [ 95.475313] kthread+0x140/0x160 [ 95.478567] ? set_kthread_struct+0x40/0x40 [ 95.482774] ret_from_fork+0x1f/0x30
On 8/26/21 9:49 PM, Damien Le Moal wrote: > So the mq-deadline priority patch reduces performance by nearly half at high QD. > With the modified patch, we are back to better numbers, but still a significant > 20% drop at high QD. Hi Damien, An implementation of I/O priority for the deadline scheduler that reduces the IOPS drop to 1% on my test setup is available here: https://github.com/bvanassche/linux/tree/block-for-next > (*) Note: in all cases using the mq-deadline scheduler, for the first run at > QD=1, I get this splat 100% of the time. > > [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] > [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 > [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn > [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 > [ 95.415904] Call Trace: > [ 95.418373] try_to_wake_up+0x268/0x7c0 > [ 95.422238] blk_update_request+0x25b/0x420 > [ 95.426452] blk_mq_end_request+0x1c/0x120 > [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] > [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 > [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 > [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 > [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 > [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 > [ 95.463377] process_one_work+0x26c/0x570 > [ 95.467421] worker_thread+0x55/0x3c0 > [ 95.475313] kthread+0x140/0x160 > [ 95.482774] ret_from_fork+0x1f/0x30 I don't see any function names in the above call stack that refer to the mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can tell me how to reproduce this (kernel commit + kernel config) I will take a look. Thanks, Bart.
On 2021/8/27 11:13, Jens Axboe wrote: > On 8/26/21 8:48 PM, Bart Van Assche wrote: >> On 8/26/21 5:05 PM, Jens Axboe wrote: >>> On 8/26/21 6:03 PM, Bart Van Assche wrote: >>>> Here is an overview of the tests I ran so far, all on the same test >>>> setup: >>>> * No I/O scheduler: about 5630 K IOPS. >>>> * Kernel v5.11 + mq-deadline: about 1100 K IOPS. >>>> * block-for-next + mq-deadline: about 760 K IOPS. >>>> * block-for-next with improved mq-deadline performance: about 970 K IOPS. >>> >>> So we're still off by about 12%, I don't think that is good enough. >>> That's assuming that v5.11 + mq-deadline is the same as for-next with >>> the mq-deadline change reverted? Because that would be the key number to >>> compare it with. >> >> With the patch series that is available at >> https://github.com/bvanassche/linux/tree/block-for-next the same test reports >> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the >> linux-block mailing list after I have finished testing that series. > > OK sounds good. I do think we should just do the revert at this point, > any real fix is going to end up being bigger than I'd like at this > point. Then we can re-introduce the feature once we're happy with the > results. Yes, It's already rc7 and it's no longer good for big changes. Revert is the best solution, and apply my patch is a compromise solution. >
On 2021/8/27 2:09, Bart Van Assche wrote: > On 8/26/21 7:40 AM, Zhen Lei wrote: >> lock protection needs to be added only in >> dd_finish_request(), which is unlikely to cause significant performance >> side effects. > > Not sure the above is correct. Every new atomic instruction has a measurable > performance overhead. But I guess in this case that overhead is smaller than > the time needed to sum 128 per-CPU variables. > >> Tested on my 128-core board with two ssd disks. >> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others> >> Before: >> [183K/0/0 iops] >> [172K/0/0 iops] >> >> After: >> [258K/0/0 iops] >> [258K/0/0 iops] > > Nice work! > >> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests") > > Shouldn't the Fixes: tag be used only for patches that modify functionality? > I'm not sure it is appropriate to use this tag for performance improvements. > >> struct deadline_data { >> @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio, >> } >> >> /* Number of requests queued for a given priority level. */ >> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) >> +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) >> { >> - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio); >> + return dd->per_prio[prio].nr_queued; >> } > > Please leave out "__always_inline". Modern compilers are smart enough to > inline this function without using the "inline" keyword. Yes. > >> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, >> >> prio = ioprio_class_to_prio[ioprio_class]; >> dd_count(dd, inserted, prio); >> + per_prio = &dd->per_prio[prio]; >> + per_prio->nr_queued++; >> >> if (blk_mq_sched_try_insert_merge(q, rq, &free)) { >> blk_mq_free_requests(&free); > > I think the above is wrong - nr_queued should not be incremented if the > request is merged into another request. Please move the code that increments > nr_queued past the above if-statement. So dd_count(dd, inserted, prio) needs to be moved behind "if-statement" as well? > > Thanks, > > Bart. > . >
On 2021/8/27 10:30, Damien Le Moal wrote: > On Thu, 2021-08-26 at 22:40 +0800, Zhen Lei wrote: >> dd_queued() traverses the percpu variable for summation. The more cores, >> the higher the performance overhead. I currently have a 128-core board and >> this function takes 2.5 us. If the number of high-priority requests is >> small and the number of low- and medium-priority requests is large, the >> performance impact is significant. >> >> Let's maintain a non-percpu member variable 'nr_queued', which is >> incremented by 1 immediately following "inserted++" and decremented by 1 >> immediately following "completed++". Because both the judgment dd_queued() >> in dd_dispatch_request() and operation "inserted++" in dd_insert_request() >> are protected by dd->lock, lock protection needs to be added only in >> dd_finish_request(), which is unlikely to cause significant performance >> side effects. >> >> Tested on my 128-core board with two ssd disks. >> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others> >> Before: >> [183K/0/0 iops] >> [172K/0/0 iops] >> >> After: >> [258K/0/0 iops] >> [258K/0/0 iops] >> >> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests") >> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> >> --- >> block/mq-deadline.c | 14 +++++++++----- >> 1 file changed, 9 insertions(+), 5 deletions(-) >> >> diff --git a/block/mq-deadline.c b/block/mq-deadline.c >> index a09761cbdf12e58..d8f6aa12de80049 100644 >> --- a/block/mq-deadline.c >> +++ b/block/mq-deadline.c >> @@ -79,6 +79,7 @@ struct dd_per_prio { >> struct list_head fifo_list[DD_DIR_COUNT]; >> /* Next request in FIFO order. Read, write or both are NULL. */ >> struct request *next_rq[DD_DIR_COUNT]; >> + unsigned int nr_queued; >> }; >> >> struct deadline_data { >> @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio, >> } >> >> /* Number of requests queued for a given priority level. */ >> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) >> +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) >> { >> - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio); >> + return dd->per_prio[prio].nr_queued; >> } >> >> /* >> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, >> >> prio = ioprio_class_to_prio[ioprio_class]; >> dd_count(dd, inserted, prio); >> + per_prio = &dd->per_prio[prio]; >> + per_prio->nr_queued++; >> >> if (blk_mq_sched_try_insert_merge(q, rq, &free)) { >> blk_mq_free_requests(&free); >> @@ -719,7 +722,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, >> >> trace_block_rq_insert(rq); >> >> - per_prio = &dd->per_prio[prio]; >> if (at_head) { >> list_add(&rq->queuelist, &per_prio->dispatch); >> } else { >> @@ -790,12 +792,14 @@ static void dd_finish_request(struct request *rq) >> const u8 ioprio_class = dd_rq_ioclass(rq); >> const enum dd_prio prio = ioprio_class_to_prio[ioprio_class]; >> struct dd_per_prio *per_prio = &dd->per_prio[prio]; >> + unsigned long flags; >> >> dd_count(dd, completed, prio); >> + spin_lock_irqsave(&dd->lock, flags); >> + per_prio->nr_queued--; >> + spin_unlock_irqrestore(&dd->lock, flags); > > dd->lock is not taken with irqsave everywhere else. This leads to hard lockups > which I hit right away on boot. To avoid this, we need a spin_lock_irqsave() > everywhere. Yes, it's safer to add interrupt protection. I noticed that too. But I thought there was a convention for upper-layer functions to turn off interrupts. So I didn't touch it. > > Of note is that without this patch, testing on nullblk with Bart's script on > 5.14.0-rc7, I get this splat: > > [ 198.726920] watchdog: BUG: soft lockup - CPU#20 stuck for 26s! > [kworker/20:1H:260] > [ 198.734550] Modules linked in: null_blk rpcsec_gss_krb5 auth_rpcgss nfsv4 > dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 > nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject > nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set > nf_tables libcrc32c nfnetlink sunrpc vfat fat iTCO_wdt iTCO_vendor_support > ipmi_ssif x86_pkg_temp_thermal acpi_ipmi coretemp ipmi_si ioatdma i2c_i801 bfq > i2c_smbus lpc_ich intel_pch_thermal dca ipmi_devintf ipmi_msghandler > acpi_power_meter fuse ip_tables sd_mod ast i2c_algo_bit drm_vram_helper > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm > drm i40e crct10dif_pclmul mpt3sas crc32_pclmul ahci ghash_clmulni_intel libahci > libata raid_class scsi_transport_sas pkcs8_key_parser > [ 198.805375] irq event stamp: 25378690 > [ 198.809063] hardirqs last enabled at (25378689): [<ffffffff81149959>] > ktime_get+0x109/0x120 > [ 198.817545] hardirqs last disabled at (25378690): [<ffffffff8190519b>] > sysvec_apic_timer_interrupt+0xb/0x90 > [ 198.827327] softirqs last enabled at (25337302): [<ffffffff810b331f>] > __irq_exit_rcu+0xbf/0xe0 > [ 198.836066] softirqs last disabled at (25337297): [<ffffffff810b331f>] > __irq_exit_rcu+0xbf/0xe0 > [ 198.844802] CPU: 20 PID: 260 Comm: kworker/20:1H Not tainted 5.14.0-rc7+ > #1324 > [ 198.852059] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3 > 02/21/2020 > [ 198.859487] Workqueue: kblockd blk_mq_run_work_fn > [ 198.864222] RIP: 0010:__list_add_valid+0x33/0x40 > [ 198.868868] Code: f2 0f 85 ec 44 44 00 4c 8b 0a 4d 39 c1 0f 85 08 45 44 00 > 48 39 d7 0f 84 e8 44 44 00 4c 39 cf 0f 84 df 44 44 00 b8 01 00 00 00 <c3> 66 66 > 2e 0f 1f 84 00 00 00 00 00 90 48 8b 17 4c 8b 47 08 48 b8 > [ 198.887712] RSP: 0018:ffff8883f1337d68 EFLAGS: 00000206 > [ 198.892963] RAX: 0000000000000001 RBX: ffff88857dae0840 RCX: > 0000000000000000 > [ 198.900132] RDX: ffff8885387e2bc8 RSI: ffff8885387e2bc8 RDI: > ffff8885387e2d48 > [ 198.907300] RBP: ffff8883f1337d90 R08: ffff8883f1337d90 R09: > ffff8883f1337d90 > [ 198.914467] R10: 0000000000000020 R11: 0000000000000001 R12: > 000000000000000a > [ 198.921632] R13: ffff88857dae0800 R14: ffff8885bd3f3400 R15: > ffff8885bd276200 > [ 198.928801] FS: 0000000000000000(0000) GS:ffff888860100000(0000) > knlGS:0000000000000000 > [ 198.936929] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 198.942703] CR2: 0000000002204440 CR3: 0000000107322004 CR4: > 00000000007706e0 > [ 198.949871] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 198.957036] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 198.964203] PKRU: 55555554 > [ 198.966933] Call Trace: > [ 198.969401] __blk_mq_do_dispatch_sched+0x234/0x2f0 > [ 198.974314] __blk_mq_sched_dispatch_requests+0xf4/0x140 > [ 198.979662] blk_mq_sched_dispatch_requests+0x30/0x60 > [ 198.984744] __blk_mq_run_hw_queue+0x49/0x90 > [ 198.989041] process_one_work+0x26c/0x570 > [ 198.993083] worker_thread+0x55/0x3c0 > [ 198.996776] ? process_one_work+0x570/0x570 > [ 199.000993] kthread+0x140/0x160 > [ 199.004243] ? set_kthread_struct+0x40/0x40 > [ 199.008452] ret_from_fork+0x1f/0x30 > > > > >> >> if (blk_queue_is_zoned(q)) { >> - unsigned long flags; >> - >> spin_lock_irqsave(&dd->zone_lock, flags); >> blk_req_zone_write_unlock(rq); >> if (!list_empty(&per_prio->fifo_list[DD_WRITE])) >
On 8/27/21 6:45 PM, Leizhen (ThunderTown) wrote: > On 2021/8/27 11:13, Jens Axboe wrote: >> On 8/26/21 8:48 PM, Bart Van Assche wrote: >>> With the patch series that is available at >>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports >>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the >>> linux-block mailing list after I have finished testing that series. >> >> OK sounds good. I do think we should just do the revert at this point, >> any real fix is going to end up being bigger than I'd like at this >> point. Then we can re-introduce the feature once we're happy with the >> results. > > Yes, It's already rc7 and it's no longer good for big changes. Revert is the > best solution, and apply my patch is a compromise solution. Please take a look at the patch series that is available at https://github.com/bvanassche/linux/tree/block-for-next. Performance for that patch series is significantly better than with your patch. Thanks, Bart.
On 8/27/21 6:59 PM, Leizhen (ThunderTown) wrote: >>> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, >>> >>> prio = ioprio_class_to_prio[ioprio_class]; >>> dd_count(dd, inserted, prio); >>> + per_prio = &dd->per_prio[prio]; >>> + per_prio->nr_queued++; >>> >>> if (blk_mq_sched_try_insert_merge(q, rq, &free)) { >>> blk_mq_free_requests(&free); >> >> I think the above is wrong - nr_queued should not be incremented if the >> request is merged into another request. Please move the code that increments >> nr_queued past the above if-statement. > > So dd_count(dd, inserted, prio) needs to be moved behind "if-statement" as well? dd_insert_request() is called if a request is inserted and also if it is requeued. dd_finish_request() is called once per request. Keeping dd_count() before blk_mq_sched_try_insert_merge() is fine since blk_mq_free_requests() will call dd_finish_request() indirectly if a request is merged. However, dd_count() must only happen once per request and must not be used if a request is requeued. Additionally, since dd_insert_request() is called with dd->lock held and since dd_finish_request() is called directly from inside dd_insert_request() if a request is merged, acquiring dd->lock from inside dd_finish_request() may trigger a deadlock. A convenient way to trigger this code path is by running test block/015 from https://github.com/osandov/blktests/. Bart.
On 2021/8/28 10:19, Bart Van Assche wrote: > On 8/27/21 6:45 PM, Leizhen (ThunderTown) wrote: >> On 2021/8/27 11:13, Jens Axboe wrote: >>> On 8/26/21 8:48 PM, Bart Van Assche wrote: >>>> With the patch series that is available at >>>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports >>>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the >>>> linux-block mailing list after I have finished testing that series. >>> >>> OK sounds good. I do think we should just do the revert at this point, >>> any real fix is going to end up being bigger than I'd like at this >>> point. Then we can re-introduce the feature once we're happy with the >>> results. >> >> Yes, It's already rc7 and it's no longer good for big changes. Revert is the >> best solution, and apply my patch is a compromise solution. > > Please take a look at the patch series that is available at > https://github.com/bvanassche/linux/tree/block-for-next. Performance for > that patch series is significantly better than with your patch. Yes, this patch is better than mine. However, Jens prefers to avoid the risk of functional stability in v5.14. v5.15 doesn't need my patch or revert. I'll test your patch this afternoon. I don't have the environment yet. > > Thanks, > > Bart. > . >
On 2021/8/28 10:42, Leizhen (ThunderTown) wrote: > > > On 2021/8/28 10:19, Bart Van Assche wrote: >> On 8/27/21 6:45 PM, Leizhen (ThunderTown) wrote: >>> On 2021/8/27 11:13, Jens Axboe wrote: >>>> On 8/26/21 8:48 PM, Bart Van Assche wrote: >>>>> With the patch series that is available at >>>>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports >>>>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the >>>>> linux-block mailing list after I have finished testing that series. >>>> >>>> OK sounds good. I do think we should just do the revert at this point, >>>> any real fix is going to end up being bigger than I'd like at this >>>> point. Then we can re-introduce the feature once we're happy with the >>>> results. >>> >>> Yes, It's already rc7 and it's no longer good for big changes. Revert is the >>> best solution, and apply my patch is a compromise solution. >> >> Please take a look at the patch series that is available at >> https://github.com/bvanassche/linux/tree/block-for-next. Performance for >> that patch series is significantly better than with your patch. > > Yes, this patch is better than mine. However, Jens prefers to avoid the risk of > functional stability in v5.14. v5.15 doesn't need my patch or revert. > > I'll test your patch this afternoon. I don't have the environment yet. Revert: 253K/0/0 258K/0/0 With your patch: 258K/0/0 252K/0/0 With my patch: 245K/0/0 258K/0/0 244K/0/0 I see that Jens has already pushed "revert" into v5.14-rc8. > >> >> Thanks, >> >> Bart. >> . >>
On 2021/08/27 23:34, Bart Van Assche wrote: > On 8/26/21 9:49 PM, Damien Le Moal wrote: >> So the mq-deadline priority patch reduces performance by nearly half at high QD. >> With the modified patch, we are back to better numbers, but still a significant >> 20% drop at high QD. > > Hi Damien, > > An implementation of I/O priority for the deadline scheduler that reduces the > IOPS drop to 1% on my test setup is available here: > https://github.com/bvanassche/linux/tree/block-for-next > >> (*) Note: in all cases using the mq-deadline scheduler, for the first run at >> QD=1, I get this splat 100% of the time. >> >> [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] >> [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 >> [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn >> [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 >> [ 95.415904] Call Trace: >> [ 95.418373] try_to_wake_up+0x268/0x7c0 >> [ 95.422238] blk_update_request+0x25b/0x420 >> [ 95.426452] blk_mq_end_request+0x1c/0x120 >> [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] >> [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 >> [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 >> [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 >> [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 >> [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 >> [ 95.463377] process_one_work+0x26c/0x570 >> [ 95.467421] worker_thread+0x55/0x3c0 >> [ 95.475313] kthread+0x140/0x160 >> [ 95.482774] ret_from_fork+0x1f/0x30 > > I don't see any function names in the above call stack that refer to the > mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can > tell me how to reproduce this (kernel commit + kernel config) I will take a > look. Indeed, the stack trace does not show any mq-deadline function. But the workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn() function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on entry to mq-deadline dispatch or finish request methods. Not entirely sure. I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.
On Sun, Aug 29, 2021 at 11:02:22PM +0000, Damien Le Moal wrote: > On 2021/08/27 23:34, Bart Van Assche wrote: > > On 8/26/21 9:49 PM, Damien Le Moal wrote: > >> So the mq-deadline priority patch reduces performance by nearly half at high QD. > >> With the modified patch, we are back to better numbers, but still a significant > >> 20% drop at high QD. > > > > Hi Damien, > > > > An implementation of I/O priority for the deadline scheduler that reduces the > > IOPS drop to 1% on my test setup is available here: > > https://github.com/bvanassche/linux/tree/block-for-next > > > >> (*) Note: in all cases using the mq-deadline scheduler, for the first run at > >> QD=1, I get this splat 100% of the time. > >> > >> [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] > >> [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 > >> [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn > >> [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 > >> [ 95.415904] Call Trace: > >> [ 95.418373] try_to_wake_up+0x268/0x7c0 > >> [ 95.422238] blk_update_request+0x25b/0x420 > >> [ 95.426452] blk_mq_end_request+0x1c/0x120 > >> [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] > >> [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 > >> [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 > >> [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 > >> [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 > >> [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 > >> [ 95.463377] process_one_work+0x26c/0x570 > >> [ 95.467421] worker_thread+0x55/0x3c0 > >> [ 95.475313] kthread+0x140/0x160 > >> [ 95.482774] ret_from_fork+0x1f/0x30 > > > > I don't see any function names in the above call stack that refer to the > > mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can > > tell me how to reproduce this (kernel commit + kernel config) I will take a > > look. > > Indeed, the stack trace does not show any mq-deadline function. But the > workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn() > function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on > entry to mq-deadline dispatch or finish request methods. Not entirely sure. I don't think you can be stuck on the *unlock* part, though. In my experience, that function showing up in a soft lockup indicates you're in a broken loop that's repeatedly locking and unlocking. I haven't found anything immediately obvious in this call chain, though. > I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config. Surely 5.14.0-rc7, right?
On 8/29/21 16:02, Damien Le Moal wrote: > On 2021/08/27 23:34, Bart Van Assche wrote: >> On 8/26/21 9:49 PM, Damien Le Moal wrote: >>> So the mq-deadline priority patch reduces performance by nearly half at high QD. >>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at >>> QD=1, I get this splat 100% of the time. >>> >>> [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] >>> [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 >>> [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn >>> [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 >>> [ 95.415904] Call Trace: >>> [ 95.418373] try_to_wake_up+0x268/0x7c0 >>> [ 95.422238] blk_update_request+0x25b/0x420 >>> [ 95.426452] blk_mq_end_request+0x1c/0x120 >>> [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] >>> [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 >>> [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 >>> [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 >>> [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 >>> [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 >>> [ 95.463377] process_one_work+0x26c/0x570 >>> [ 95.467421] worker_thread+0x55/0x3c0 >>> [ 95.475313] kthread+0x140/0x160 >>> [ 95.482774] ret_from_fork+0x1f/0x30 >> >> I don't see any function names in the above call stack that refer to the >> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can >> tell me how to reproduce this (kernel commit + kernel config) I will take a >> look. > > Indeed, the stack trace does not show any mq-deadline function. But the > workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn() > function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on > entry to mq-deadline dispatch or finish request methods. Not entirely sure. > > I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config. Hi Damien, Thank you for having shared the kernel configuration used in your test. So far I have not yet been able to reproduce the above call trace in a VM. Could the above call trace have been triggered by the mpt3sas driver instead of the mq-deadline I/O scheduler? Thanks, Bart.
On 2021/08/30 11:32, Keith Busch wrote: > On Sun, Aug 29, 2021 at 11:02:22PM +0000, Damien Le Moal wrote: >> On 2021/08/27 23:34, Bart Van Assche wrote: >>> On 8/26/21 9:49 PM, Damien Le Moal wrote: >>>> So the mq-deadline priority patch reduces performance by nearly half at high QD. >>>> With the modified patch, we are back to better numbers, but still a significant >>>> 20% drop at high QD. >>> >>> Hi Damien, >>> >>> An implementation of I/O priority for the deadline scheduler that reduces the >>> IOPS drop to 1% on my test setup is available here: >>> https://github.com/bvanassche/linux/tree/block-for-next >>> >>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at >>>> QD=1, I get this splat 100% of the time. >>>> >>>> [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] >>>> [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 >>>> [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn >>>> [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 >>>> [ 95.415904] Call Trace: >>>> [ 95.418373] try_to_wake_up+0x268/0x7c0 >>>> [ 95.422238] blk_update_request+0x25b/0x420 >>>> [ 95.426452] blk_mq_end_request+0x1c/0x120 >>>> [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] >>>> [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 >>>> [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 >>>> [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 >>>> [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 >>>> [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 >>>> [ 95.463377] process_one_work+0x26c/0x570 >>>> [ 95.467421] worker_thread+0x55/0x3c0 >>>> [ 95.475313] kthread+0x140/0x160 >>>> [ 95.482774] ret_from_fork+0x1f/0x30 >>> >>> I don't see any function names in the above call stack that refer to the >>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can >>> tell me how to reproduce this (kernel commit + kernel config) I will take a >>> look. >> >> Indeed, the stack trace does not show any mq-deadline function. But the >> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn() >> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on >> entry to mq-deadline dispatch or finish request methods. Not entirely sure. > > I don't think you can be stuck on the *unlock* part, though. In my > experience, that function showing up in a soft lockup indicates you're > in a broken loop that's repeatedly locking and unlocking. I haven't > found anything immediately obvious in this call chain, though. Arg. I misread the stack trace. It is an unlock, not a lock... > >> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config. > > Surely 5.14.0-rc7, right? Oops. Yes. >
On 2021/08/30 11:40, Bart Van Assche wrote: > On 8/29/21 16:02, Damien Le Moal wrote: >> On 2021/08/27 23:34, Bart Van Assche wrote: >>> On 8/26/21 9:49 PM, Damien Le Moal wrote: >>>> So the mq-deadline priority patch reduces performance by nearly half at high QD. >>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at >>>> QD=1, I get this splat 100% of the time. >>>> >>>> [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] >>>> [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 >>>> [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn >>>> [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 >>>> [ 95.415904] Call Trace: >>>> [ 95.418373] try_to_wake_up+0x268/0x7c0 >>>> [ 95.422238] blk_update_request+0x25b/0x420 >>>> [ 95.426452] blk_mq_end_request+0x1c/0x120 >>>> [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] >>>> [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 >>>> [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 >>>> [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 >>>> [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 >>>> [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 >>>> [ 95.463377] process_one_work+0x26c/0x570 >>>> [ 95.467421] worker_thread+0x55/0x3c0 >>>> [ 95.475313] kthread+0x140/0x160 >>>> [ 95.482774] ret_from_fork+0x1f/0x30 >>> >>> I don't see any function names in the above call stack that refer to the >>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can >>> tell me how to reproduce this (kernel commit + kernel config) I will take a >>> look. >> >> Indeed, the stack trace does not show any mq-deadline function. But the >> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn() >> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on >> entry to mq-deadline dispatch or finish request methods. Not entirely sure. >> >> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config. > > Hi Damien, > > Thank you for having shared the kernel configuration used in your test. > So far I have not yet been able to reproduce the above call trace in a > VM. Could the above call trace have been triggered by the mpt3sas driver > instead of the mq-deadline I/O scheduler? The above was triggered using nullblk with the test script you sent. I was not using drives on the HBA or AHCI when it happens. And I can reproduce this 100% of the time by running your script with QD=1. > > Thanks, > > Bart. > >
On 8/29/21 8:07 PM, Damien Le Moal wrote: > On 2021/08/30 11:40, Bart Van Assche wrote: >> On 8/29/21 16:02, Damien Le Moal wrote: >>> On 2021/08/27 23:34, Bart Van Assche wrote: >>>> On 8/26/21 9:49 PM, Damien Le Moal wrote: >>>>> So the mq-deadline priority patch reduces performance by nearly half at high QD. >>>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at >>>>> QD=1, I get this splat 100% of the time. >>>>> >>>>> [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] >>>>> [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 >>>>> [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn >>>>> [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 >>>>> [ 95.415904] Call Trace: >>>>> [ 95.418373] try_to_wake_up+0x268/0x7c0 >>>>> [ 95.422238] blk_update_request+0x25b/0x420 >>>>> [ 95.426452] blk_mq_end_request+0x1c/0x120 >>>>> [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] >>>>> [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 >>>>> [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 >>>>> [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 >>>>> [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 >>>>> [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 >>>>> [ 95.463377] process_one_work+0x26c/0x570 >>>>> [ 95.467421] worker_thread+0x55/0x3c0 >>>>> [ 95.475313] kthread+0x140/0x160 >>>>> [ 95.482774] ret_from_fork+0x1f/0x30 >>>> >>>> I don't see any function names in the above call stack that refer to the >>>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can >>>> tell me how to reproduce this (kernel commit + kernel config) I will take a >>>> look. >>> >>> Indeed, the stack trace does not show any mq-deadline function. But the >>> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn() >>> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on >>> entry to mq-deadline dispatch or finish request methods. Not entirely sure. >>> >>> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config. >> >> Hi Damien, >> >> Thank you for having shared the kernel configuration used in your test. >> So far I have not yet been able to reproduce the above call trace in a >> VM. Could the above call trace have been triggered by the mpt3sas driver >> instead of the mq-deadline I/O scheduler? > > The above was triggered using nullblk with the test script you sent. I was not > using drives on the HBA or AHCI when it happens. And I can reproduce this 100% > of the time by running your script with QD=1. Hi Damien, I rebuilt kernel v5.14-rc7 (tag v5.14-rc7) after having run git clean -f -d -x and reran my nullb iops test with the mq-deadline scheduler. No kernel complaints appeared in the kernel log. Next I enabled lockdep (CONFIG_PROVE_LOCKING=y) and reran the nullb iops test with mq-deadline as scheduler. Again zero complaints appeared in the kernel log. Next I ran a subset of the blktests test (./check -q block). All tests passed and no complaints appeared in the kernel log. Please help with root-causing this issue by rerunning the test on your setup after having enabled lockdep. Thanks, Bart.
On 2021/08/31 2:14, Bart Van Assche wrote: > On 8/29/21 8:07 PM, Damien Le Moal wrote: >> On 2021/08/30 11:40, Bart Van Assche wrote: >>> On 8/29/21 16:02, Damien Le Moal wrote: >>>> On 2021/08/27 23:34, Bart Van Assche wrote: >>>>> On 8/26/21 9:49 PM, Damien Le Moal wrote: >>>>>> So the mq-deadline priority patch reduces performance by nearly half at high QD. >>>>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at >>>>>> QD=1, I get this splat 100% of the time. >>>>>> >>>>>> [ 95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757] >>>>>> [ 95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334 >>>>>> [ 95.307504] Workqueue: kblockd blk_mq_run_work_fn >>>>>> [ 95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40 >>>>>> [ 95.415904] Call Trace: >>>>>> [ 95.418373] try_to_wake_up+0x268/0x7c0 >>>>>> [ 95.422238] blk_update_request+0x25b/0x420 >>>>>> [ 95.426452] blk_mq_end_request+0x1c/0x120 >>>>>> [ 95.430576] null_handle_cmd+0x12d/0x270 [null_blk] >>>>>> [ 95.435485] blk_mq_dispatch_rq_list+0x13c/0x7f0 >>>>>> [ 95.443826] __blk_mq_do_dispatch_sched+0xb5/0x2f0 >>>>>> [ 95.448653] __blk_mq_sched_dispatch_requests+0xf4/0x140 >>>>>> [ 95.453998] blk_mq_sched_dispatch_requests+0x30/0x60 >>>>>> [ 95.459083] __blk_mq_run_hw_queue+0x49/0x90 >>>>>> [ 95.463377] process_one_work+0x26c/0x570 >>>>>> [ 95.467421] worker_thread+0x55/0x3c0 >>>>>> [ 95.475313] kthread+0x140/0x160 >>>>>> [ 95.482774] ret_from_fork+0x1f/0x30 >>>>> >>>>> I don't see any function names in the above call stack that refer to the >>>>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can >>>>> tell me how to reproduce this (kernel commit + kernel config) I will take a >>>>> look. >>>> >>>> Indeed, the stack trace does not show any mq-deadline function. But the >>>> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn() >>>> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on >>>> entry to mq-deadline dispatch or finish request methods. Not entirely sure. >>>> >>>> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config. >>> >>> Hi Damien, >>> >>> Thank you for having shared the kernel configuration used in your test. >>> So far I have not yet been able to reproduce the above call trace in a >>> VM. Could the above call trace have been triggered by the mpt3sas driver >>> instead of the mq-deadline I/O scheduler? >> >> The above was triggered using nullblk with the test script you sent. I was not >> using drives on the HBA or AHCI when it happens. And I can reproduce this 100% >> of the time by running your script with QD=1. > > Hi Damien, > > I rebuilt kernel v5.14-rc7 (tag v5.14-rc7) after having run git clean -f -d -x > and reran my nullb iops test with the mq-deadline scheduler. No kernel complaints > appeared in the kernel log. Next I enabled lockdep (CONFIG_PROVE_LOCKING=y) and > reran the nullb iops test with mq-deadline as scheduler. Again zero complaints > appeared in the kernel log. Next I ran a subset of the blktests test > (./check -q block). All tests passed and no complaints appeared in the kernel log. > > Please help with root-causing this issue by rerunning the test on your setup after > having enabled lockdep. OK. Will have a look again. > > Thanks, > > Bart. >
diff --git a/block/mq-deadline.c b/block/mq-deadline.c index a09761cbdf12e58..d8f6aa12de80049 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -79,6 +79,7 @@ struct dd_per_prio { struct list_head fifo_list[DD_DIR_COUNT]; /* Next request in FIFO order. Read, write or both are NULL. */ struct request *next_rq[DD_DIR_COUNT]; + unsigned int nr_queued; }; struct deadline_data { @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio, } /* Number of requests queued for a given priority level. */ -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio) { - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio); + return dd->per_prio[prio].nr_queued; } /* @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, prio = ioprio_class_to_prio[ioprio_class]; dd_count(dd, inserted, prio); + per_prio = &dd->per_prio[prio]; + per_prio->nr_queued++; if (blk_mq_sched_try_insert_merge(q, rq, &free)) { blk_mq_free_requests(&free); @@ -719,7 +722,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, trace_block_rq_insert(rq); - per_prio = &dd->per_prio[prio]; if (at_head) { list_add(&rq->queuelist, &per_prio->dispatch); } else { @@ -790,12 +792,14 @@ static void dd_finish_request(struct request *rq) const u8 ioprio_class = dd_rq_ioclass(rq); const enum dd_prio prio = ioprio_class_to_prio[ioprio_class]; struct dd_per_prio *per_prio = &dd->per_prio[prio]; + unsigned long flags; dd_count(dd, completed, prio); + spin_lock_irqsave(&dd->lock, flags); + per_prio->nr_queued--; + spin_unlock_irqrestore(&dd->lock, flags); if (blk_queue_is_zoned(q)) { - unsigned long flags; - spin_lock_irqsave(&dd->zone_lock, flags); blk_req_zone_write_unlock(rq); if (!list_empty(&per_prio->fifo_list[DD_WRITE]))
dd_queued() traverses the percpu variable for summation. The more cores, the higher the performance overhead. I currently have a 128-core board and this function takes 2.5 us. If the number of high-priority requests is small and the number of low- and medium-priority requests is large, the performance impact is significant. Let's maintain a non-percpu member variable 'nr_queued', which is incremented by 1 immediately following "inserted++" and decremented by 1 immediately following "completed++". Because both the judgment dd_queued() in dd_dispatch_request() and operation "inserted++" in dd_insert_request() are protected by dd->lock, lock protection needs to be added only in dd_finish_request(), which is unlikely to cause significant performance side effects. Tested on my 128-core board with two ssd disks. fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others> Before: [183K/0/0 iops] [172K/0/0 iops] After: [258K/0/0 iops] [258K/0/0 iops] Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests") Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> --- block/mq-deadline.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)