diff mbox series

block/mq-deadline: Speed up the dispatch of low-priority requests

Message ID 20210826144039.2143-1-thunder.leizhen@huawei.com (mailing list archive)
State New, archived
Headers show
Series block/mq-deadline: Speed up the dispatch of low-priority requests | expand

Commit Message

Zhen Lei Aug. 26, 2021, 2:40 p.m. UTC
dd_queued() traverses the percpu variable for summation. The more cores,
the higher the performance overhead. I currently have a 128-core board and
this function takes 2.5 us. If the number of high-priority requests is
small and the number of low- and medium-priority requests is large, the
performance impact is significant.

Let's maintain a non-percpu member variable 'nr_queued', which is
incremented by 1 immediately following "inserted++" and decremented by 1
immediately following "completed++". Because both the judgment dd_queued()
in dd_dispatch_request() and operation "inserted++" in dd_insert_request()
are protected by dd->lock, lock protection needs to be added only in
dd_finish_request(), which is unlikely to cause significant performance
side effects.

Tested on my 128-core board with two ssd disks.
fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others>
Before:
[183K/0/0 iops]
[172K/0/0 iops]

After:
[258K/0/0 iops]
[258K/0/0 iops]

Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests")
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
---
 block/mq-deadline.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

Comments

Bart Van Assche Aug. 26, 2021, 6:09 p.m. UTC | #1
On 8/26/21 7:40 AM, Zhen Lei wrote:
> lock protection needs to be added only in
> dd_finish_request(), which is unlikely to cause significant performance
> side effects.

Not sure the above is correct. Every new atomic instruction has a measurable
performance overhead. But I guess in this case that overhead is smaller than
the time needed to sum 128 per-CPU variables.

> Tested on my 128-core board with two ssd disks.
> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others>
> Before:
> [183K/0/0 iops]
> [172K/0/0 iops]
> 
> After:
> [258K/0/0 iops]
> [258K/0/0 iops]

Nice work!

> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests")

Shouldn't the Fixes: tag be used only for patches that modify functionality?
I'm not sure it is appropriate to use this tag for performance improvements.

>  struct deadline_data {
> @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  }
>  
>  /* Number of requests queued for a given priority level. */
> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +	return dd->per_prio[prio].nr_queued;
>  }

Please leave out "__always_inline". Modern compilers are smart enough to
inline this function without using the "inline" keyword.

> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	prio = ioprio_class_to_prio[ioprio_class];
>  	dd_count(dd, inserted, prio);
> +	per_prio = &dd->per_prio[prio];
> +	per_prio->nr_queued++;
>  
>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>  		blk_mq_free_requests(&free);

I think the above is wrong - nr_queued should not be incremented if the
request is merged into another request. Please move the code that increments
nr_queued past the above if-statement.

Thanks,

Bart.
Jens Axboe Aug. 26, 2021, 6:13 p.m. UTC | #2
On 8/26/21 12:09 PM, Bart Van Assche wrote:
> On 8/26/21 7:40 AM, Zhen Lei wrote:
>> lock protection needs to be added only in dd_finish_request(), which
>> is unlikely to cause significant performance side effects.
> 
> Not sure the above is correct. Every new atomic instruction has a
> measurable performance overhead. But I guess in this case that
> overhead is smaller than the time needed to sum 128 per-CPU variables.

perpcu counters only really work, if the summing is not in a hot path,
or if the summing is just some "not zero" thing instead of a full sum.
They just don't scale at all for even moderately sized systems.

>> Tested on my 128-core board with two ssd disks.
>> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others>
>> Before:
>> [183K/0/0 iops]
>> [172K/0/0 iops]
>>
>> After:
>> [258K/0/0 iops]
>> [258K/0/0 iops]
> 
> Nice work!
> 
>> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests")
> 
> Shouldn't the Fixes: tag be used only for patches that modify
> functionality? I'm not sure it is appropriate to use this tag for
> performance improvements.

For a regression this big, I think it's the right thing. Anyone that may
backport the original commit definitely should also get the followup
fix. This isn't just a performance improvement, it's fixing a big
performance regression.
Jens Axboe Aug. 26, 2021, 6:45 p.m. UTC | #3
On 8/26/21 12:13 PM, Jens Axboe wrote:
> On 8/26/21 12:09 PM, Bart Van Assche wrote:
>> On 8/26/21 7:40 AM, Zhen Lei wrote:
>>> lock protection needs to be added only in dd_finish_request(), which
>>> is unlikely to cause significant performance side effects.
>>
>> Not sure the above is correct. Every new atomic instruction has a
>> measurable performance overhead. But I guess in this case that
>> overhead is smaller than the time needed to sum 128 per-CPU variables.
> 
> perpcu counters only really work, if the summing is not in a hot path,
> or if the summing is just some "not zero" thing instead of a full sum.
> They just don't scale at all for even moderately sized systems.

Ugh it's actually even worse in this case, since you do:

static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)               
{                                                                               
	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);        
}

which ends up iterating possible CPUs _twice_!

Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching
to deadline, of which 37% of the overhead is from dd_dispatch().

With the posted patch applied, it runs at 2.3M IOPS with mq-deadline,
which is a lot better. This is on my 3970X test box, so 32 cores, 64
threads.

Bart, either we fix this up ASAP and get rid of the percpu counters in
the hot path, or we revert this patch.
Bart Van Assche Aug. 26, 2021, 7:17 p.m. UTC | #4
On 8/26/21 11:45 AM, Jens Axboe wrote:
> Bart, either we fix this up ASAP and get rid of the percpu counters in
> the hot path, or we revert this patch.

I'm at home taking care of my son because he is sick.

Do you want to wait for Zhen to post a fixed version of his patch or do you
want me to fix his patch?

Bart.
Jens Axboe Aug. 26, 2021, 7:32 p.m. UTC | #5
On 8/26/21 1:17 PM, Bart Van Assche wrote:
> On 8/26/21 11:45 AM, Jens Axboe wrote:
>> Bart, either we fix this up ASAP and get rid of the percpu counters in
>> the hot path, or we revert this patch.
> 
> I'm at home taking care of my son because he is sick.

Take care of that, always more important than the kernel :-)

> Do you want to wait for Zhen to post a fixed version of his patch or do you
> want me to fix his patch?

To be a bit proactive, I queued up a revert. Messing with it at this
point is making me a bit nervous. If the final patch ends up looking
clean enough, then we can still go that route. But it really has to be
unequivocally done this late in the game...
Bart Van Assche Aug. 26, 2021, 11:49 p.m. UTC | #6
On 8/26/21 11:45 AM, Jens Axboe wrote:
> Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching
> to deadline, of which 37% of the overhead is from dd_dispatch().
> 
> With the posted patch applied, it runs at 2.3M IOPS with mq-deadline,
> which is a lot better. This is on my 3970X test box, so 32 cores, 64
> threads.

Hi Jens,

With the script below, queue depth >= 2 and an improved version of
Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an
8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen
reported with fewer CPU cores. Is that good enough?

Thanks,

Bart.

#!/bin/bash

if [ -e /sys/kernel/config/nullb ]; then
    for d in /sys/kernel/config/nullb/*; do
        [ -d "$d" ] && rmdir "$d"
    done
fi
numcpus=$(grep -c ^processor /proc/cpuinfo)
modprobe -r null_blk
[ -e /sys/module/null_blk ] && exit $?
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 512 > blocksize &&
    echo 0 > home_node &&
    echo 0 > irqmode &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 2 > queue_mode &&
    echo 1 > power ||
    exit $?

(
    cd /sys/block/nullb0/queue &&
	echo 2 > rq_affinity
) || exit $?

iodepth=${1:-1}
runtime=30
args=()
if [ "$iodepth" = 1 ]; then
	args+=(--ioengine=psync)
else
	args+=(--ioengine=io_uring --iodepth_batch=$((iodepth/2)))
fi
args+=(--iodepth=$iodepth --name=nullb0 --filename=/dev/nullb0\
    --rw=read --bs=512 --loops=$((1<<20)) --direct=1 --numjobs=$numcpus \
    --thread --runtime=$runtime --invalidate=1 --gtod_reduce=1 \
    --group_reporting=1 --ioscheduler=mq-deadline)
if numactl -m 0 -N 0 echo >&/dev/null; then
	numactl -m 0 -N 0 -- fio "${args[@]}"
else
	fio "${args[@]}"
fi
Jens Axboe Aug. 26, 2021, 11:51 p.m. UTC | #7
On 8/26/21 5:49 PM, Bart Van Assche wrote:
> On 8/26/21 11:45 AM, Jens Axboe wrote:
>> Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching
>> to deadline, of which 37% of the overhead is from dd_dispatch().
>>
>> With the posted patch applied, it runs at 2.3M IOPS with mq-deadline,
>> which is a lot better. This is on my 3970X test box, so 32 cores, 64
>> threads.
> 
> Hi Jens,
> 
> With the script below, queue depth >= 2 and an improved version of
> Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an
> 8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen
> reported with fewer CPU cores. Is that good enough?

That depends, what kind of IOPS are you getting if you revert the
original change?
Bart Van Assche Aug. 27, 2021, 12:03 a.m. UTC | #8
On 8/26/21 4:51 PM, Jens Axboe wrote:
> On 8/26/21 5:49 PM, Bart Van Assche wrote:
>> On 8/26/21 11:45 AM, Jens Axboe wrote:
>>> Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching
>>> to deadline, of which 37% of the overhead is from dd_dispatch().
>>>
>>> With the posted patch applied, it runs at 2.3M IOPS with mq-deadline,
>>> which is a lot better. This is on my 3970X test box, so 32 cores, 64
>>> threads.
>>
>> Hi Jens,
>>
>> With the script below, queue depth >= 2 and an improved version of
>> Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an
>> 8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen
>> reported with fewer CPU cores. Is that good enough?
> 
> That depends, what kind of IOPS are you getting if you revert the
> original change?

Hi Jens,

Here is an overview of the tests I ran so far, all on the same test
setup:
* No I/O scheduler:               about 5630 K IOPS.
* Kernel v5.11 + mq-deadline:     about 1100 K IOPS.
* block-for-next + mq-deadline:   about  760 K IOPS.
* block-for-next with improved mq-deadline performance: about 970 K IOPS.

Bart.
Jens Axboe Aug. 27, 2021, 12:05 a.m. UTC | #9
On 8/26/21 6:03 PM, Bart Van Assche wrote:
> On 8/26/21 4:51 PM, Jens Axboe wrote:
>> On 8/26/21 5:49 PM, Bart Van Assche wrote:
>>> On 8/26/21 11:45 AM, Jens Axboe wrote:
>>>> Just ran a quick test here, and I go from 3.55M IOPS to 1.23M switching
>>>> to deadline, of which 37% of the overhead is from dd_dispatch().
>>>>
>>>> With the posted patch applied, it runs at 2.3M IOPS with mq-deadline,
>>>> which is a lot better. This is on my 3970X test box, so 32 cores, 64
>>>> threads.
>>>
>>> Hi Jens,
>>>
>>> With the script below, queue depth >= 2 and an improved version of
>>> Zhen's patch I see 970 K IOPS with the mq-deadline scheduler in an
>>> 8 core VM (i7-4790 CPU). In other words, more IOPS than what Zhen
>>> reported with fewer CPU cores. Is that good enough?
>>
>> That depends, what kind of IOPS are you getting if you revert the
>> original change?
> 
> Hi Jens,
> 
> Here is an overview of the tests I ran so far, all on the same test
> setup:
> * No I/O scheduler:               about 5630 K IOPS.
> * Kernel v5.11 + mq-deadline:     about 1100 K IOPS.
> * block-for-next + mq-deadline:   about  760 K IOPS.
> * block-for-next with improved mq-deadline performance: about 970 K IOPS.

So we're still off by about 12%, I don't think that is good enough.
That's assuming that v5.11 + mq-deadline is the same as for-next with
the mq-deadline change reverted? Because that would be the key number to
compare it with.
Bart Van Assche Aug. 27, 2021, 12:58 a.m. UTC | #10
On 8/26/21 5:05 PM, Jens Axboe wrote:
>> Here is an overview of the tests I ran so far, all on the same test
>> setup:
>> * No I/O scheduler:               about 5630 K IOPS.
>> * Kernel v5.11 + mq-deadline:     about 1100 K IOPS.
>> * block-for-next + mq-deadline:   about  760 K IOPS.
>> * block-for-next with improved mq-deadline performance: about 970 K IOPS.
> 
> So we're still off by about 12%, I don't think that is good enough.
> That's assuming that v5.11 + mq-deadline is the same as for-next with
> the mq-deadline change reverted? Because that would be the key number to
> compare it with.

A quick attempt to eliminate the ktime_get_ns() call from
__dd_dispatch_request() improved performance a few percent but not as much
as I was hoping. I need a few days of time to run these measurements, to
optimize performance further and to rerun all functional and performance
tests.

Thanks,

Bart.
Damien Le Moal Aug. 27, 2021, 2:30 a.m. UTC | #11
On Thu, 2021-08-26 at 22:40 +0800, Zhen Lei wrote:
> dd_queued() traverses the percpu variable for summation. The more cores,
> the higher the performance overhead. I currently have a 128-core board and
> this function takes 2.5 us. If the number of high-priority requests is
> small and the number of low- and medium-priority requests is large, the
> performance impact is significant.
> 
> Let's maintain a non-percpu member variable 'nr_queued', which is
> incremented by 1 immediately following "inserted++" and decremented by 1
> immediately following "completed++". Because both the judgment dd_queued()
> in dd_dispatch_request() and operation "inserted++" in dd_insert_request()
> are protected by dd->lock, lock protection needs to be added only in
> dd_finish_request(), which is unlikely to cause significant performance
> side effects.
> 
> Tested on my 128-core board with two ssd disks.
> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others>
> Before:
> [183K/0/0 iops]
> [172K/0/0 iops]
> 
> After:
> [258K/0/0 iops]
> [258K/0/0 iops]
> 
> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests")
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
>  block/mq-deadline.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index a09761cbdf12e58..d8f6aa12de80049 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -79,6 +79,7 @@ struct dd_per_prio {
>  	struct list_head fifo_list[DD_DIR_COUNT];
>  	/* Next request in FIFO order. Read, write or both are NULL. */
>  	struct request *next_rq[DD_DIR_COUNT];
> +	unsigned int nr_queued;
>  };
>  
>  struct deadline_data {
> @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  }
>  
>  /* Number of requests queued for a given priority level. */
> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +	return dd->per_prio[prio].nr_queued;
>  }
>  
>  /*
> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	prio = ioprio_class_to_prio[ioprio_class];
>  	dd_count(dd, inserted, prio);
> +	per_prio = &dd->per_prio[prio];
> +	per_prio->nr_queued++;
>  
>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>  		blk_mq_free_requests(&free);
> @@ -719,7 +722,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	trace_block_rq_insert(rq);
>  
> -	per_prio = &dd->per_prio[prio];
>  	if (at_head) {
>  		list_add(&rq->queuelist, &per_prio->dispatch);
>  	} else {
> @@ -790,12 +792,14 @@ static void dd_finish_request(struct request *rq)
>  	const u8 ioprio_class = dd_rq_ioclass(rq);
>  	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>  	struct dd_per_prio *per_prio = &dd->per_prio[prio];
> +	unsigned long flags;
>  
>  	dd_count(dd, completed, prio);
> +	spin_lock_irqsave(&dd->lock, flags);
> +	per_prio->nr_queued--;
> +	spin_unlock_irqrestore(&dd->lock, flags);

dd->lock is not taken with irqsave everywhere else. This leads to hard lockups
which I hit right away on boot. To avoid this, we need a spin_lock_irqsave()
everywhere.

Of note is that without this patch, testing on nullblk with Bart's script on
5.14.0-rc7, I get this splat:

 [  198.726920] watchdog: BUG: soft lockup - CPU#20 stuck for 26s!
[kworker/20:1H:260]
[  198.734550] Modules linked in: null_blk rpcsec_gss_krb5 auth_rpcgss nfsv4
dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject
nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables libcrc32c nfnetlink sunrpc vfat fat iTCO_wdt iTCO_vendor_support
ipmi_ssif x86_pkg_temp_thermal acpi_ipmi coretemp ipmi_si ioatdma i2c_i801 bfq
i2c_smbus lpc_ich intel_pch_thermal dca ipmi_devintf ipmi_msghandler
acpi_power_meter fuse ip_tables sd_mod ast i2c_algo_bit drm_vram_helper
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm
drm i40e crct10dif_pclmul mpt3sas crc32_pclmul ahci ghash_clmulni_intel libahci
libata raid_class scsi_transport_sas pkcs8_key_parser
[  198.805375] irq event stamp: 25378690
[  198.809063] hardirqs last  enabled at (25378689): [<ffffffff81149959>]
ktime_get+0x109/0x120
[  198.817545] hardirqs last disabled at (25378690): [<ffffffff8190519b>]
sysvec_apic_timer_interrupt+0xb/0x90
[  198.827327] softirqs last  enabled at (25337302): [<ffffffff810b331f>]
__irq_exit_rcu+0xbf/0xe0
[  198.836066] softirqs last disabled at (25337297): [<ffffffff810b331f>]
__irq_exit_rcu+0xbf/0xe0
[  198.844802] CPU: 20 PID: 260 Comm: kworker/20:1H Not tainted 5.14.0-rc7+
#1324
[  198.852059] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
02/21/2020
[  198.859487] Workqueue: kblockd blk_mq_run_work_fn
[  198.864222] RIP: 0010:__list_add_valid+0x33/0x40
[  198.868868] Code: f2 0f 85 ec 44 44 00 4c 8b 0a 4d 39 c1 0f 85 08 45 44 00
48 39 d7 0f 84 e8 44 44 00 4c 39 cf 0f 84 df 44 44 00 b8 01 00 00 00 <c3> 66 66
2e 0f 1f 84 00 00 00 00 00 90 48 8b 17 4c 8b 47 08 48 b8
[  198.887712] RSP: 0018:ffff8883f1337d68 EFLAGS: 00000206
[  198.892963] RAX: 0000000000000001 RBX: ffff88857dae0840 RCX:
0000000000000000
[  198.900132] RDX: ffff8885387e2bc8 RSI: ffff8885387e2bc8 RDI:
ffff8885387e2d48
[  198.907300] RBP: ffff8883f1337d90 R08: ffff8883f1337d90 R09:
ffff8883f1337d90
[  198.914467] R10: 0000000000000020 R11: 0000000000000001 R12:
000000000000000a
[  198.921632] R13: ffff88857dae0800 R14: ffff8885bd3f3400 R15:
ffff8885bd276200
[  198.928801] FS:  0000000000000000(0000) GS:ffff888860100000(0000)
knlGS:0000000000000000
[  198.936929] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  198.942703] CR2: 0000000002204440 CR3: 0000000107322004 CR4:
00000000007706e0
[  198.949871] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  198.957036] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  198.964203] PKRU: 55555554
[  198.966933] Call Trace:
[  198.969401]  __blk_mq_do_dispatch_sched+0x234/0x2f0
[  198.974314]  __blk_mq_sched_dispatch_requests+0xf4/0x140
[  198.979662]  blk_mq_sched_dispatch_requests+0x30/0x60
[  198.984744]  __blk_mq_run_hw_queue+0x49/0x90
[  198.989041]  process_one_work+0x26c/0x570
[  198.993083]  worker_thread+0x55/0x3c0
[  198.996776]  ? process_one_work+0x570/0x570
[  199.000993]  kthread+0x140/0x160
[  199.004243]  ? set_kthread_struct+0x40/0x40
[  199.008452]  ret_from_fork+0x1f/0x30




>  
>  	if (blk_queue_is_zoned(q)) {
> -		unsigned long flags;
> -
>  		spin_lock_irqsave(&dd->zone_lock, flags);
>  		blk_req_zone_write_unlock(rq);
>  		if (!list_empty(&per_prio->fifo_list[DD_WRITE]))
Bart Van Assche Aug. 27, 2021, 2:48 a.m. UTC | #12
On 8/26/21 5:05 PM, Jens Axboe wrote:
> On 8/26/21 6:03 PM, Bart Van Assche wrote:
>> Here is an overview of the tests I ran so far, all on the same test
>> setup:
>> * No I/O scheduler:               about 5630 K IOPS.
>> * Kernel v5.11 + mq-deadline:     about 1100 K IOPS.
>> * block-for-next + mq-deadline:   about  760 K IOPS.
>> * block-for-next with improved mq-deadline performance: about 970 K IOPS.
> 
> So we're still off by about 12%, I don't think that is good enough.
> That's assuming that v5.11 + mq-deadline is the same as for-next with
> the mq-deadline change reverted? Because that would be the key number to
> compare it with.

With the patch series that is available at
https://github.com/bvanassche/linux/tree/block-for-next the same test reports
1090 K IOPS or only 1% below the v5.11 result. I will post that series on the
linux-block mailing list after I have finished testing that series.

Bart.
Jens Axboe Aug. 27, 2021, 3:13 a.m. UTC | #13
On 8/26/21 8:48 PM, Bart Van Assche wrote:
> On 8/26/21 5:05 PM, Jens Axboe wrote:
>> On 8/26/21 6:03 PM, Bart Van Assche wrote:
>>> Here is an overview of the tests I ran so far, all on the same test
>>> setup:
>>> * No I/O scheduler:               about 5630 K IOPS.
>>> * Kernel v5.11 + mq-deadline:     about 1100 K IOPS.
>>> * block-for-next + mq-deadline:   about  760 K IOPS.
>>> * block-for-next with improved mq-deadline performance: about 970 K IOPS.
>>
>> So we're still off by about 12%, I don't think that is good enough.
>> That's assuming that v5.11 + mq-deadline is the same as for-next with
>> the mq-deadline change reverted? Because that would be the key number to
>> compare it with.
> 
> With the patch series that is available at
> https://github.com/bvanassche/linux/tree/block-for-next the same test reports
> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the
> linux-block mailing list after I have finished testing that series.

OK sounds good. I do think we should just do the revert at this point,
any real fix is going to end up being bigger than I'd like at this
point. Then we can re-introduce the feature once we're happy with the
results.
Damien Le Moal Aug. 27, 2021, 4:49 a.m. UTC | #14
On 2021/08/27 12:13, Jens Axboe wrote:
> On 8/26/21 8:48 PM, Bart Van Assche wrote:
>> On 8/26/21 5:05 PM, Jens Axboe wrote:
>>> On 8/26/21 6:03 PM, Bart Van Assche wrote:
>>>> Here is an overview of the tests I ran so far, all on the same test
>>>> setup:
>>>> * No I/O scheduler:               about 5630 K IOPS.
>>>> * Kernel v5.11 + mq-deadline:     about 1100 K IOPS.
>>>> * block-for-next + mq-deadline:   about  760 K IOPS.
>>>> * block-for-next with improved mq-deadline performance: about 970 K IOPS.
>>>
>>> So we're still off by about 12%, I don't think that is good enough.
>>> That's assuming that v5.11 + mq-deadline is the same as for-next with
>>> the mq-deadline change reverted? Because that would be the key number to
>>> compare it with.
>>
>> With the patch series that is available at
>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports
>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the
>> linux-block mailing list after I have finished testing that series.
> 
> OK sounds good. I do think we should just do the revert at this point,
> any real fix is going to end up being bigger than I'd like at this
> point. Then we can re-introduce the feature once we're happy with the
> results.

FYI, here is what I get with Bart's test script running on a dual socket
8-cores/16-threads Xeon machine (32 CPUs total):

* 5.14.0-rc7, with fb926032b320 reverted:
-----------------------------------------

QD 1: IOPS=305k (*)
QD 2: IOPS=411k
QD 4: IOPS=408k
QD 8: IOPS=414k

* 5.14.0-rc7, current (no modification):
----------------------------------------

QD 1: IOPS=296k (*)
QD 2: IOPS=207k
QD 4: IOPS=208k
QD 8: IOPS=210k

* 5.14.0-rc7, with modified patch (attached to this email):
-----------------------------------------------------------

QD 1: IOPS=287k (*)
QD 2: IOPS=334k
QD 4: IOPS=330k
QD 8: IOPS=334k

For reference, with the same test script using the none scheduler:

QD 1: IOPS=2172K
QD 2: IOPS=1075K
QD 4: IOPS=1075k
QD 8: IOPS=1077k

So the mq-deadline priority patch reduces performance by nearly half at high QD.
With the modified patch, we are back to better numbers, but still a significant
20% drop at high QD.

(*) Note: in all cases using the mq-deadline scheduler, for the first run at
QD=1, I get this splat 100% of the time.

[   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
[   95.181351] Modules linked in: null_blk rpcsec_gss_krb5 auth_rpcgss nfsv4
dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4
nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject
nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables libcrc32c nfnetlink sunrpc vfat fat iTCO_wdt iTCO_vendor_support
ipmi_ssif x86_pkg_temp_thermal coretemp i2c_i801 acpi_ipmi bfq i2c_smbus ioatdma
lpc_ich ipmi_si intel_pch_thermal dca ipmi_devintf ipmi_msghandler
acpi_power_meter fuse ip_tables sd_mod ast i2c_algo_bit drm_vram_helper
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm
drm i40e crct10dif_pclmul mpt3sas crc32_pclmul ahci ghash_clmulni_intel libahci
raid_class scsi_transport_sas libata pkcs8_key_parser
[   95.252173] irq event stamp: 30500990
[   95.255860] hardirqs last  enabled at (30500989): [<ffffffff81910e2d>]
_raw_spin_unlock_irqrestore+0x2d/0x40
[   95.265735] hardirqs last disabled at (30500990): [<ffffffff819050cb>]
sysvec_apic_timer_interrupt+0xb/0x90
[   95.275520] softirqs last  enabled at (30496338): [<ffffffff810b331f>]
__irq_exit_rcu+0xbf/0xe0
[   95.284259] softirqs last disabled at (30496333): [<ffffffff810b331f>]
__irq_exit_rcu+0xbf/0xe0
[   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
[   95.300076] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3 02/21/2020
[   95.307504] Workqueue: kblockd blk_mq_run_work_fn
[   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
[   95.317844] Code: c7 18 53 48 89 f3 48 8b 74 24 10 e8 35 82 80 ff 48 89 ef e8
9d ac 80 ff 80 e7 02 74 06 e8 23 33 8b ff fb 65 ff 0d 8b 5f 70 7e <5b> 5d c3 0f
1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 fd 65 ff
[   95.336680] RSP: 0018:ffff888448cefbb0 EFLAGS: 00000202
[   95.341934] RAX: 0000000001d1687d RBX: 0000000000000287 RCX: 0000000000000006
[   95.349103] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81910e2d
[   95.356270] RBP: ffff888192649218 R08: 0000000000000001 R09: 0000000000000001
[   95.363437] R10: 0000000000000000 R11: 000000000000005c R12: 0000000000000000
[   95.370604] R13: 0000000000000287 R14: ffff888192649218 R15: ffff88885fe68e80
[   95.377771] FS:  0000000000000000(0000) GS:ffff88885fe00000(0000)
knlGS:0000000000000000
[   95.385901] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   95.391675] CR2: 00007f59bfe71f80 CR3: 000000074a91e005 CR4: 00000000007706f0
[   95.398842] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   95.406009] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   95.413176] PKRU: 55555554
[   95.415904] Call Trace:
[   95.418373]  try_to_wake_up+0x268/0x7c0
[   95.422238]  blk_update_request+0x25b/0x420
[   95.426452]  blk_mq_end_request+0x1c/0x120
[   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
[   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
[   95.440130]  ? sbitmap_get+0x86/0x190
[   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
[   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
[   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
[   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
[   95.463377]  process_one_work+0x26c/0x570
[   95.467421]  worker_thread+0x55/0x3c0
[   95.471103]  ? process_one_work+0x570/0x570
[   95.475313]  kthread+0x140/0x160
[   95.478567]  ? set_kthread_struct+0x40/0x40
[   95.482774]  ret_from_fork+0x1f/0x30
Bart Van Assche Aug. 27, 2021, 2:34 p.m. UTC | #15
On 8/26/21 9:49 PM, Damien Le Moal wrote:
> So the mq-deadline priority patch reduces performance by nearly half at high QD.
> With the modified patch, we are back to better numbers, but still a significant
> 20% drop at high QD.

Hi Damien,

An implementation of I/O priority for the deadline scheduler that reduces the
IOPS drop to 1% on my test setup is available here:
https://github.com/bvanassche/linux/tree/block-for-next

> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
> QD=1, I get this splat 100% of the time.
> 
> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
> [   95.415904] Call Trace:
> [   95.418373]  try_to_wake_up+0x268/0x7c0
> [   95.422238]  blk_update_request+0x25b/0x420
> [   95.426452]  blk_mq_end_request+0x1c/0x120
> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
> [   95.463377]  process_one_work+0x26c/0x570
> [   95.467421]  worker_thread+0x55/0x3c0
> [   95.475313]  kthread+0x140/0x160
> [   95.482774]  ret_from_fork+0x1f/0x30

I don't see any function names in the above call stack that refer to the
mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
tell me how to reproduce this (kernel commit + kernel config) I will take a
look.

Thanks,

Bart.
Zhen Lei Aug. 28, 2021, 1:45 a.m. UTC | #16
On 2021/8/27 11:13, Jens Axboe wrote:
> On 8/26/21 8:48 PM, Bart Van Assche wrote:
>> On 8/26/21 5:05 PM, Jens Axboe wrote:
>>> On 8/26/21 6:03 PM, Bart Van Assche wrote:
>>>> Here is an overview of the tests I ran so far, all on the same test
>>>> setup:
>>>> * No I/O scheduler:               about 5630 K IOPS.
>>>> * Kernel v5.11 + mq-deadline:     about 1100 K IOPS.
>>>> * block-for-next + mq-deadline:   about  760 K IOPS.
>>>> * block-for-next with improved mq-deadline performance: about 970 K IOPS.
>>>
>>> So we're still off by about 12%, I don't think that is good enough.
>>> That's assuming that v5.11 + mq-deadline is the same as for-next with
>>> the mq-deadline change reverted? Because that would be the key number to
>>> compare it with.
>>
>> With the patch series that is available at
>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports
>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the
>> linux-block mailing list after I have finished testing that series.
> 
> OK sounds good. I do think we should just do the revert at this point,
> any real fix is going to end up being bigger than I'd like at this
> point. Then we can re-introduce the feature once we're happy with the
> results.

Yes, It's already rc7 and it's no longer good for big changes. Revert is the
best solution, and apply my patch is a compromise solution.

>
Zhen Lei Aug. 28, 2021, 1:59 a.m. UTC | #17
On 2021/8/27 2:09, Bart Van Assche wrote:
> On 8/26/21 7:40 AM, Zhen Lei wrote:
>> lock protection needs to be added only in
>> dd_finish_request(), which is unlikely to cause significant performance
>> side effects.
> 
> Not sure the above is correct. Every new atomic instruction has a measurable
> performance overhead. But I guess in this case that overhead is smaller than
> the time needed to sum 128 per-CPU variables.
> 
>> Tested on my 128-core board with two ssd disks.
>> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others>
>> Before:
>> [183K/0/0 iops]
>> [172K/0/0 iops]
>>
>> After:
>> [258K/0/0 iops]
>> [258K/0/0 iops]
> 
> Nice work!
> 
>> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests")
> 
> Shouldn't the Fixes: tag be used only for patches that modify functionality?
> I'm not sure it is appropriate to use this tag for performance improvements.
> 
>>  struct deadline_data {
>> @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>>  }
>>  
>>  /* Number of requests queued for a given priority level. */
>> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>> +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>>  {
>> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
>> +	return dd->per_prio[prio].nr_queued;
>>  }
> 
> Please leave out "__always_inline". Modern compilers are smart enough to
> inline this function without using the "inline" keyword.

Yes.

> 
>> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>>  
>>  	prio = ioprio_class_to_prio[ioprio_class];
>>  	dd_count(dd, inserted, prio);
>> +	per_prio = &dd->per_prio[prio];
>> +	per_prio->nr_queued++;
>>  
>>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>>  		blk_mq_free_requests(&free);
> 
> I think the above is wrong - nr_queued should not be incremented if the
> request is merged into another request. Please move the code that increments
> nr_queued past the above if-statement.

So dd_count(dd, inserted, prio) needs to be moved behind "if-statement" as well?

> 
> Thanks,
> 
> Bart.
> .
>
Zhen Lei Aug. 28, 2021, 2:14 a.m. UTC | #18
On 2021/8/27 10:30, Damien Le Moal wrote:
> On Thu, 2021-08-26 at 22:40 +0800, Zhen Lei wrote:
>> dd_queued() traverses the percpu variable for summation. The more cores,
>> the higher the performance overhead. I currently have a 128-core board and
>> this function takes 2.5 us. If the number of high-priority requests is
>> small and the number of low- and medium-priority requests is large, the
>> performance impact is significant.
>>
>> Let's maintain a non-percpu member variable 'nr_queued', which is
>> incremented by 1 immediately following "inserted++" and decremented by 1
>> immediately following "completed++". Because both the judgment dd_queued()
>> in dd_dispatch_request() and operation "inserted++" in dd_insert_request()
>> are protected by dd->lock, lock protection needs to be added only in
>> dd_finish_request(), which is unlikely to cause significant performance
>> side effects.
>>
>> Tested on my 128-core board with two ssd disks.
>> fio bs=4k rw=read iodepth=128 cpus_allowed=0-95 <others>
>> Before:
>> [183K/0/0 iops]
>> [172K/0/0 iops]
>>
>> After:
>> [258K/0/0 iops]
>> [258K/0/0 iops]
>>
>> Fixes: fb926032b320 ("block/mq-deadline: Prioritize high-priority requests")
>> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
>> ---
>>  block/mq-deadline.c | 14 +++++++++-----
>>  1 file changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> index a09761cbdf12e58..d8f6aa12de80049 100644
>> --- a/block/mq-deadline.c
>> +++ b/block/mq-deadline.c
>> @@ -79,6 +79,7 @@ struct dd_per_prio {
>>  	struct list_head fifo_list[DD_DIR_COUNT];
>>  	/* Next request in FIFO order. Read, write or both are NULL. */
>>  	struct request *next_rq[DD_DIR_COUNT];
>> +	unsigned int nr_queued;
>>  };
>>  
>>  struct deadline_data {
>> @@ -277,9 +278,9 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>>  }
>>  
>>  /* Number of requests queued for a given priority level. */
>> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>> +static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>>  {
>> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
>> +	return dd->per_prio[prio].nr_queued;
>>  }
>>  
>>  /*
>> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>>  
>>  	prio = ioprio_class_to_prio[ioprio_class];
>>  	dd_count(dd, inserted, prio);
>> +	per_prio = &dd->per_prio[prio];
>> +	per_prio->nr_queued++;
>>  
>>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>>  		blk_mq_free_requests(&free);
>> @@ -719,7 +722,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>>  
>>  	trace_block_rq_insert(rq);
>>  
>> -	per_prio = &dd->per_prio[prio];
>>  	if (at_head) {
>>  		list_add(&rq->queuelist, &per_prio->dispatch);
>>  	} else {
>> @@ -790,12 +792,14 @@ static void dd_finish_request(struct request *rq)
>>  	const u8 ioprio_class = dd_rq_ioclass(rq);
>>  	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>>  	struct dd_per_prio *per_prio = &dd->per_prio[prio];
>> +	unsigned long flags;
>>  
>>  	dd_count(dd, completed, prio);
>> +	spin_lock_irqsave(&dd->lock, flags);
>> +	per_prio->nr_queued--;
>> +	spin_unlock_irqrestore(&dd->lock, flags);
> 
> dd->lock is not taken with irqsave everywhere else. This leads to hard lockups
> which I hit right away on boot. To avoid this, we need a spin_lock_irqsave()
> everywhere.

Yes, it's safer to add interrupt protection. I noticed that too. But I thought
there was a convention for upper-layer functions to turn off interrupts. So I
didn't touch it.

> 
> Of note is that without this patch, testing on nullblk with Bart's script on
> 5.14.0-rc7, I get this splat:
> 
>  [  198.726920] watchdog: BUG: soft lockup - CPU#20 stuck for 26s!
> [kworker/20:1H:260]
> [  198.734550] Modules linked in: null_blk rpcsec_gss_krb5 auth_rpcgss nfsv4
> dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4
> nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject
> nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
> nf_tables libcrc32c nfnetlink sunrpc vfat fat iTCO_wdt iTCO_vendor_support
> ipmi_ssif x86_pkg_temp_thermal acpi_ipmi coretemp ipmi_si ioatdma i2c_i801 bfq
> i2c_smbus lpc_ich intel_pch_thermal dca ipmi_devintf ipmi_msghandler
> acpi_power_meter fuse ip_tables sd_mod ast i2c_algo_bit drm_vram_helper
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm
> drm i40e crct10dif_pclmul mpt3sas crc32_pclmul ahci ghash_clmulni_intel libahci
> libata raid_class scsi_transport_sas pkcs8_key_parser
> [  198.805375] irq event stamp: 25378690
> [  198.809063] hardirqs last  enabled at (25378689): [<ffffffff81149959>]
> ktime_get+0x109/0x120
> [  198.817545] hardirqs last disabled at (25378690): [<ffffffff8190519b>]
> sysvec_apic_timer_interrupt+0xb/0x90
> [  198.827327] softirqs last  enabled at (25337302): [<ffffffff810b331f>]
> __irq_exit_rcu+0xbf/0xe0
> [  198.836066] softirqs last disabled at (25337297): [<ffffffff810b331f>]
> __irq_exit_rcu+0xbf/0xe0
> [  198.844802] CPU: 20 PID: 260 Comm: kworker/20:1H Not tainted 5.14.0-rc7+
> #1324
> [  198.852059] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> 02/21/2020
> [  198.859487] Workqueue: kblockd blk_mq_run_work_fn
> [  198.864222] RIP: 0010:__list_add_valid+0x33/0x40
> [  198.868868] Code: f2 0f 85 ec 44 44 00 4c 8b 0a 4d 39 c1 0f 85 08 45 44 00
> 48 39 d7 0f 84 e8 44 44 00 4c 39 cf 0f 84 df 44 44 00 b8 01 00 00 00 <c3> 66 66
> 2e 0f 1f 84 00 00 00 00 00 90 48 8b 17 4c 8b 47 08 48 b8
> [  198.887712] RSP: 0018:ffff8883f1337d68 EFLAGS: 00000206
> [  198.892963] RAX: 0000000000000001 RBX: ffff88857dae0840 RCX:
> 0000000000000000
> [  198.900132] RDX: ffff8885387e2bc8 RSI: ffff8885387e2bc8 RDI:
> ffff8885387e2d48
> [  198.907300] RBP: ffff8883f1337d90 R08: ffff8883f1337d90 R09:
> ffff8883f1337d90
> [  198.914467] R10: 0000000000000020 R11: 0000000000000001 R12:
> 000000000000000a
> [  198.921632] R13: ffff88857dae0800 R14: ffff8885bd3f3400 R15:
> ffff8885bd276200
> [  198.928801] FS:  0000000000000000(0000) GS:ffff888860100000(0000)
> knlGS:0000000000000000
> [  198.936929] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  198.942703] CR2: 0000000002204440 CR3: 0000000107322004 CR4:
> 00000000007706e0
> [  198.949871] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  198.957036] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  198.964203] PKRU: 55555554
> [  198.966933] Call Trace:
> [  198.969401]  __blk_mq_do_dispatch_sched+0x234/0x2f0
> [  198.974314]  __blk_mq_sched_dispatch_requests+0xf4/0x140
> [  198.979662]  blk_mq_sched_dispatch_requests+0x30/0x60
> [  198.984744]  __blk_mq_run_hw_queue+0x49/0x90
> [  198.989041]  process_one_work+0x26c/0x570
> [  198.993083]  worker_thread+0x55/0x3c0
> [  198.996776]  ? process_one_work+0x570/0x570
> [  199.000993]  kthread+0x140/0x160
> [  199.004243]  ? set_kthread_struct+0x40/0x40
> [  199.008452]  ret_from_fork+0x1f/0x30
> 
> 
> 
> 
>>  
>>  	if (blk_queue_is_zoned(q)) {
>> -		unsigned long flags;
>> -
>>  		spin_lock_irqsave(&dd->zone_lock, flags);
>>  		blk_req_zone_write_unlock(rq);
>>  		if (!list_empty(&per_prio->fifo_list[DD_WRITE]))
>
Bart Van Assche Aug. 28, 2021, 2:19 a.m. UTC | #19
On 8/27/21 6:45 PM, Leizhen (ThunderTown) wrote:
> On 2021/8/27 11:13, Jens Axboe wrote:
>> On 8/26/21 8:48 PM, Bart Van Assche wrote:
>>> With the patch series that is available at
>>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports
>>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the
>>> linux-block mailing list after I have finished testing that series.
>>
>> OK sounds good. I do think we should just do the revert at this point,
>> any real fix is going to end up being bigger than I'd like at this
>> point. Then we can re-introduce the feature once we're happy with the
>> results.
> 
> Yes, It's already rc7 and it's no longer good for big changes. Revert is the
> best solution, and apply my patch is a compromise solution.

Please take a look at the patch series that is available at
https://github.com/bvanassche/linux/tree/block-for-next. Performance for
that patch series is significantly better than with your patch.

Thanks,

Bart.
Bart Van Assche Aug. 28, 2021, 2:41 a.m. UTC | #20
On 8/27/21 6:59 PM, Leizhen (ThunderTown) wrote:
>>> @@ -711,6 +712,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>>>  
>>>  	prio = ioprio_class_to_prio[ioprio_class];
>>>  	dd_count(dd, inserted, prio);
>>> +	per_prio = &dd->per_prio[prio];
>>> +	per_prio->nr_queued++;
>>>  
>>>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>>>  		blk_mq_free_requests(&free);
>>
>> I think the above is wrong - nr_queued should not be incremented if the
>> request is merged into another request. Please move the code that increments
>> nr_queued past the above if-statement.
> 
> So dd_count(dd, inserted, prio) needs to be moved behind "if-statement" as well?

dd_insert_request() is called if a request is inserted and also if it is
requeued. dd_finish_request() is called once per request. Keeping
dd_count() before blk_mq_sched_try_insert_merge() is fine since
blk_mq_free_requests() will call dd_finish_request() indirectly if a
request is merged. However, dd_count() must only happen once per request
and must not be used if a request is requeued.

Additionally, since dd_insert_request() is called with dd->lock held and
since dd_finish_request() is called directly from inside
dd_insert_request() if a request is merged, acquiring dd->lock from
inside dd_finish_request() may trigger a deadlock. A convenient way to
trigger this code path is by running test block/015 from
https://github.com/osandov/blktests/.

Bart.
Zhen Lei Aug. 28, 2021, 2:42 a.m. UTC | #21
On 2021/8/28 10:19, Bart Van Assche wrote:
> On 8/27/21 6:45 PM, Leizhen (ThunderTown) wrote:
>> On 2021/8/27 11:13, Jens Axboe wrote:
>>> On 8/26/21 8:48 PM, Bart Van Assche wrote:
>>>> With the patch series that is available at
>>>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports
>>>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the
>>>> linux-block mailing list after I have finished testing that series.
>>>
>>> OK sounds good. I do think we should just do the revert at this point,
>>> any real fix is going to end up being bigger than I'd like at this
>>> point. Then we can re-introduce the feature once we're happy with the
>>> results.
>>
>> Yes, It's already rc7 and it's no longer good for big changes. Revert is the
>> best solution, and apply my patch is a compromise solution.
> 
> Please take a look at the patch series that is available at
> https://github.com/bvanassche/linux/tree/block-for-next. Performance for
> that patch series is significantly better than with your patch.

Yes, this patch is better than mine. However, Jens prefers to avoid the risk of
functional stability in v5.14. v5.15 doesn't need my patch or revert.

I'll test your patch this afternoon. I don't have the environment yet.

> 
> Thanks,
> 
> Bart.
> .
>
Zhen Lei Aug. 28, 2021, 1:14 p.m. UTC | #22
On 2021/8/28 10:42, Leizhen (ThunderTown) wrote:
> 
> 
> On 2021/8/28 10:19, Bart Van Assche wrote:
>> On 8/27/21 6:45 PM, Leizhen (ThunderTown) wrote:
>>> On 2021/8/27 11:13, Jens Axboe wrote:
>>>> On 8/26/21 8:48 PM, Bart Van Assche wrote:
>>>>> With the patch series that is available at
>>>>> https://github.com/bvanassche/linux/tree/block-for-next the same test reports
>>>>> 1090 K IOPS or only 1% below the v5.11 result. I will post that series on the
>>>>> linux-block mailing list after I have finished testing that series.
>>>>
>>>> OK sounds good. I do think we should just do the revert at this point,
>>>> any real fix is going to end up being bigger than I'd like at this
>>>> point. Then we can re-introduce the feature once we're happy with the
>>>> results.
>>>
>>> Yes, It's already rc7 and it's no longer good for big changes. Revert is the
>>> best solution, and apply my patch is a compromise solution.
>>
>> Please take a look at the patch series that is available at
>> https://github.com/bvanassche/linux/tree/block-for-next. Performance for
>> that patch series is significantly better than with your patch.
> 
> Yes, this patch is better than mine. However, Jens prefers to avoid the risk of
> functional stability in v5.14. v5.15 doesn't need my patch or revert.
> 
> I'll test your patch this afternoon. I don't have the environment yet.


Revert:
253K/0/0
258K/0/0

With your patch:
258K/0/0
252K/0/0

With my patch:
245K/0/0
258K/0/0
244K/0/0

I see that Jens has already pushed "revert" into v5.14-rc8.


> 
>>
>> Thanks,
>>
>> Bart.
>> .
>>
Damien Le Moal Aug. 29, 2021, 11:02 p.m. UTC | #23
On 2021/08/27 23:34, Bart Van Assche wrote:
> On 8/26/21 9:49 PM, Damien Le Moal wrote:
>> So the mq-deadline priority patch reduces performance by nearly half at high QD.
>> With the modified patch, we are back to better numbers, but still a significant
>> 20% drop at high QD.
> 
> Hi Damien,
> 
> An implementation of I/O priority for the deadline scheduler that reduces the
> IOPS drop to 1% on my test setup is available here:
> https://github.com/bvanassche/linux/tree/block-for-next
> 
>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
>> QD=1, I get this splat 100% of the time.
>>
>> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
>> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
>> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
>> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
>> [   95.415904] Call Trace:
>> [   95.418373]  try_to_wake_up+0x268/0x7c0
>> [   95.422238]  blk_update_request+0x25b/0x420
>> [   95.426452]  blk_mq_end_request+0x1c/0x120
>> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
>> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
>> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
>> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
>> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
>> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
>> [   95.463377]  process_one_work+0x26c/0x570
>> [   95.467421]  worker_thread+0x55/0x3c0
>> [   95.475313]  kthread+0x140/0x160
>> [   95.482774]  ret_from_fork+0x1f/0x30
> 
> I don't see any function names in the above call stack that refer to the
> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
> tell me how to reproduce this (kernel commit + kernel config) I will take a
> look.

Indeed, the stack trace does not show any mq-deadline function. But the
workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn()
function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on
entry to mq-deadline dispatch or finish request methods. Not entirely sure.

I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.
Keith Busch Aug. 30, 2021, 2:31 a.m. UTC | #24
On Sun, Aug 29, 2021 at 11:02:22PM +0000, Damien Le Moal wrote:
> On 2021/08/27 23:34, Bart Van Assche wrote:
> > On 8/26/21 9:49 PM, Damien Le Moal wrote:
> >> So the mq-deadline priority patch reduces performance by nearly half at high QD.
> >> With the modified patch, we are back to better numbers, but still a significant
> >> 20% drop at high QD.
> > 
> > Hi Damien,
> > 
> > An implementation of I/O priority for the deadline scheduler that reduces the
> > IOPS drop to 1% on my test setup is available here:
> > https://github.com/bvanassche/linux/tree/block-for-next
> > 
> >> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
> >> QD=1, I get this splat 100% of the time.
> >>
> >> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
> >> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
> >> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
> >> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
> >> [   95.415904] Call Trace:
> >> [   95.418373]  try_to_wake_up+0x268/0x7c0
> >> [   95.422238]  blk_update_request+0x25b/0x420
> >> [   95.426452]  blk_mq_end_request+0x1c/0x120
> >> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
> >> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
> >> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
> >> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
> >> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
> >> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
> >> [   95.463377]  process_one_work+0x26c/0x570
> >> [   95.467421]  worker_thread+0x55/0x3c0
> >> [   95.475313]  kthread+0x140/0x160
> >> [   95.482774]  ret_from_fork+0x1f/0x30
> > 
> > I don't see any function names in the above call stack that refer to the
> > mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
> > tell me how to reproduce this (kernel commit + kernel config) I will take a
> > look.
> 
> Indeed, the stack trace does not show any mq-deadline function. But the
> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn()
> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on
> entry to mq-deadline dispatch or finish request methods. Not entirely sure.

I don't think you can be stuck on the *unlock* part, though. In my
experience, that function showing up in a soft lockup indicates you're
in a broken loop that's repeatedly locking and unlocking. I haven't
found anything immediately obvious in this call chain, though.

> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.

Surely 5.14.0-rc7, right?
Bart Van Assche Aug. 30, 2021, 2:40 a.m. UTC | #25
On 8/29/21 16:02, Damien Le Moal wrote:
> On 2021/08/27 23:34, Bart Van Assche wrote:
>> On 8/26/21 9:49 PM, Damien Le Moal wrote:
>>> So the mq-deadline priority patch reduces performance by nearly half at high QD.
>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
>>> QD=1, I get this splat 100% of the time.
>>>
>>> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
>>> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
>>> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
>>> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
>>> [   95.415904] Call Trace:
>>> [   95.418373]  try_to_wake_up+0x268/0x7c0
>>> [   95.422238]  blk_update_request+0x25b/0x420
>>> [   95.426452]  blk_mq_end_request+0x1c/0x120
>>> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
>>> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
>>> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
>>> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
>>> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
>>> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
>>> [   95.463377]  process_one_work+0x26c/0x570
>>> [   95.467421]  worker_thread+0x55/0x3c0
>>> [   95.475313]  kthread+0x140/0x160
>>> [   95.482774]  ret_from_fork+0x1f/0x30
>>
>> I don't see any function names in the above call stack that refer to the
>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
>> tell me how to reproduce this (kernel commit + kernel config) I will take a
>> look.
> 
> Indeed, the stack trace does not show any mq-deadline function. But the
> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn()
> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on
> entry to mq-deadline dispatch or finish request methods. Not entirely sure.
> 
> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.

Hi Damien,

Thank you for having shared the kernel configuration used in your test. 
So far I have not yet been able to reproduce the above call trace in a 
VM. Could the above call trace have been triggered by the mpt3sas driver 
instead of the mq-deadline I/O scheduler?

Thanks,

Bart.
Damien Le Moal Aug. 30, 2021, 3:03 a.m. UTC | #26
On 2021/08/30 11:32, Keith Busch wrote:
> On Sun, Aug 29, 2021 at 11:02:22PM +0000, Damien Le Moal wrote:
>> On 2021/08/27 23:34, Bart Van Assche wrote:
>>> On 8/26/21 9:49 PM, Damien Le Moal wrote:
>>>> So the mq-deadline priority patch reduces performance by nearly half at high QD.
>>>> With the modified patch, we are back to better numbers, but still a significant
>>>> 20% drop at high QD.
>>>
>>> Hi Damien,
>>>
>>> An implementation of I/O priority for the deadline scheduler that reduces the
>>> IOPS drop to 1% on my test setup is available here:
>>> https://github.com/bvanassche/linux/tree/block-for-next
>>>
>>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
>>>> QD=1, I get this splat 100% of the time.
>>>>
>>>> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
>>>> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
>>>> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
>>>> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
>>>> [   95.415904] Call Trace:
>>>> [   95.418373]  try_to_wake_up+0x268/0x7c0
>>>> [   95.422238]  blk_update_request+0x25b/0x420
>>>> [   95.426452]  blk_mq_end_request+0x1c/0x120
>>>> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
>>>> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
>>>> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
>>>> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
>>>> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
>>>> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
>>>> [   95.463377]  process_one_work+0x26c/0x570
>>>> [   95.467421]  worker_thread+0x55/0x3c0
>>>> [   95.475313]  kthread+0x140/0x160
>>>> [   95.482774]  ret_from_fork+0x1f/0x30
>>>
>>> I don't see any function names in the above call stack that refer to the
>>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
>>> tell me how to reproduce this (kernel commit + kernel config) I will take a
>>> look.
>>
>> Indeed, the stack trace does not show any mq-deadline function. But the
>> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn()
>> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on
>> entry to mq-deadline dispatch or finish request methods. Not entirely sure.
> 
> I don't think you can be stuck on the *unlock* part, though. In my
> experience, that function showing up in a soft lockup indicates you're
> in a broken loop that's repeatedly locking and unlocking. I haven't
> found anything immediately obvious in this call chain, though.

Arg. I misread the stack trace. It is an unlock, not a lock...

> 
>> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.
> 
> Surely 5.14.0-rc7, right?

Oops. Yes.

>
Damien Le Moal Aug. 30, 2021, 3:07 a.m. UTC | #27
On 2021/08/30 11:40, Bart Van Assche wrote:
> On 8/29/21 16:02, Damien Le Moal wrote:
>> On 2021/08/27 23:34, Bart Van Assche wrote:
>>> On 8/26/21 9:49 PM, Damien Le Moal wrote:
>>>> So the mq-deadline priority patch reduces performance by nearly half at high QD.
>>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
>>>> QD=1, I get this splat 100% of the time.
>>>>
>>>> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
>>>> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
>>>> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
>>>> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
>>>> [   95.415904] Call Trace:
>>>> [   95.418373]  try_to_wake_up+0x268/0x7c0
>>>> [   95.422238]  blk_update_request+0x25b/0x420
>>>> [   95.426452]  blk_mq_end_request+0x1c/0x120
>>>> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
>>>> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
>>>> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
>>>> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
>>>> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
>>>> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
>>>> [   95.463377]  process_one_work+0x26c/0x570
>>>> [   95.467421]  worker_thread+0x55/0x3c0
>>>> [   95.475313]  kthread+0x140/0x160
>>>> [   95.482774]  ret_from_fork+0x1f/0x30
>>>
>>> I don't see any function names in the above call stack that refer to the
>>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
>>> tell me how to reproduce this (kernel commit + kernel config) I will take a
>>> look.
>>
>> Indeed, the stack trace does not show any mq-deadline function. But the
>> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn()
>> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on
>> entry to mq-deadline dispatch or finish request methods. Not entirely sure.
>>
>> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.
> 
> Hi Damien,
> 
> Thank you for having shared the kernel configuration used in your test. 
> So far I have not yet been able to reproduce the above call trace in a 
> VM. Could the above call trace have been triggered by the mpt3sas driver 
> instead of the mq-deadline I/O scheduler?

The above was triggered using nullblk with the test script you sent. I was not
using drives on the HBA or AHCI when it happens. And I can reproduce this 100%
of the time by running your script with QD=1.


> 
> Thanks,
> 
> Bart.
> 
>
Bart Van Assche Aug. 30, 2021, 5:14 p.m. UTC | #28
On 8/29/21 8:07 PM, Damien Le Moal wrote:
> On 2021/08/30 11:40, Bart Van Assche wrote:
>> On 8/29/21 16:02, Damien Le Moal wrote:
>>> On 2021/08/27 23:34, Bart Van Assche wrote:
>>>> On 8/26/21 9:49 PM, Damien Le Moal wrote:
>>>>> So the mq-deadline priority patch reduces performance by nearly half at high QD.
>>>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
>>>>> QD=1, I get this splat 100% of the time.
>>>>>
>>>>> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
>>>>> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
>>>>> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
>>>>> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
>>>>> [   95.415904] Call Trace:
>>>>> [   95.418373]  try_to_wake_up+0x268/0x7c0
>>>>> [   95.422238]  blk_update_request+0x25b/0x420
>>>>> [   95.426452]  blk_mq_end_request+0x1c/0x120
>>>>> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
>>>>> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
>>>>> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
>>>>> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
>>>>> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
>>>>> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
>>>>> [   95.463377]  process_one_work+0x26c/0x570
>>>>> [   95.467421]  worker_thread+0x55/0x3c0
>>>>> [   95.475313]  kthread+0x140/0x160
>>>>> [   95.482774]  ret_from_fork+0x1f/0x30
>>>>
>>>> I don't see any function names in the above call stack that refer to the
>>>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
>>>> tell me how to reproduce this (kernel commit + kernel config) I will take a
>>>> look.
>>>
>>> Indeed, the stack trace does not show any mq-deadline function. But the
>>> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn()
>>> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on
>>> entry to mq-deadline dispatch or finish request methods. Not entirely sure.
>>>
>>> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.
>>
>> Hi Damien,
>>
>> Thank you for having shared the kernel configuration used in your test.
>> So far I have not yet been able to reproduce the above call trace in a
>> VM. Could the above call trace have been triggered by the mpt3sas driver
>> instead of the mq-deadline I/O scheduler?
> 
> The above was triggered using nullblk with the test script you sent. I was not
> using drives on the HBA or AHCI when it happens. And I can reproduce this 100%
> of the time by running your script with QD=1.

Hi Damien,

I rebuilt kernel v5.14-rc7 (tag v5.14-rc7) after having run git clean -f -d -x
and reran my nullb iops test with the mq-deadline scheduler. No kernel complaints
appeared in the kernel log. Next I enabled lockdep (CONFIG_PROVE_LOCKING=y) and
reran the nullb iops test with mq-deadline as scheduler. Again zero complaints
appeared in the kernel log. Next I ran a subset of the blktests test
(./check -q block). All tests passed and no complaints appeared in the kernel log.

Please help with root-causing this issue by rerunning the test on your setup after
having enabled lockdep.

Thanks,

Bart.
Damien Le Moal Aug. 30, 2021, 9:42 p.m. UTC | #29
On 2021/08/31 2:14, Bart Van Assche wrote:
> On 8/29/21 8:07 PM, Damien Le Moal wrote:
>> On 2021/08/30 11:40, Bart Van Assche wrote:
>>> On 8/29/21 16:02, Damien Le Moal wrote:
>>>> On 2021/08/27 23:34, Bart Van Assche wrote:
>>>>> On 8/26/21 9:49 PM, Damien Le Moal wrote:
>>>>>> So the mq-deadline priority patch reduces performance by nearly half at high QD.
>>>>>> (*) Note: in all cases using the mq-deadline scheduler, for the first run at
>>>>>> QD=1, I get this splat 100% of the time.
>>>>>>
>>>>>> [   95.173889] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/0:1H:757]
>>>>>> [   95.292994] CPU: 0 PID: 757 Comm: kworker/0:1H Not tainted 5.14.0-rc7+ #1334
>>>>>> [   95.307504] Workqueue: kblockd blk_mq_run_work_fn
>>>>>> [   95.312243] RIP: 0010:_raw_spin_unlock_irqrestore+0x35/0x40
>>>>>> [   95.415904] Call Trace:
>>>>>> [   95.418373]  try_to_wake_up+0x268/0x7c0
>>>>>> [   95.422238]  blk_update_request+0x25b/0x420
>>>>>> [   95.426452]  blk_mq_end_request+0x1c/0x120
>>>>>> [   95.430576]  null_handle_cmd+0x12d/0x270 [null_blk]
>>>>>> [   95.435485]  blk_mq_dispatch_rq_list+0x13c/0x7f0
>>>>>> [   95.443826]  __blk_mq_do_dispatch_sched+0xb5/0x2f0
>>>>>> [   95.448653]  __blk_mq_sched_dispatch_requests+0xf4/0x140
>>>>>> [   95.453998]  blk_mq_sched_dispatch_requests+0x30/0x60
>>>>>> [   95.459083]  __blk_mq_run_hw_queue+0x49/0x90
>>>>>> [   95.463377]  process_one_work+0x26c/0x570
>>>>>> [   95.467421]  worker_thread+0x55/0x3c0
>>>>>> [   95.475313]  kthread+0x140/0x160
>>>>>> [   95.482774]  ret_from_fork+0x1f/0x30
>>>>>
>>>>> I don't see any function names in the above call stack that refer to the
>>>>> mq-deadline scheduler? Did I perhaps overlook something? Anyway, if you can
>>>>> tell me how to reproduce this (kernel commit + kernel config) I will take a
>>>>> look.
>>>>
>>>> Indeed, the stack trace does not show any mq-deadline function. But the
>>>> workqueue is stuck on _raw_spin_unlock_irqrestore() in the blk_mq_run_work_fn()
>>>> function. I suspect that the spinlock is dd->lock, so the CPU may be stuck on
>>>> entry to mq-deadline dispatch or finish request methods. Not entirely sure.
>>>>
>>>> I got this splat with 5.4.0-rc7 (Linus tag patch) with the attached config.
>>>
>>> Hi Damien,
>>>
>>> Thank you for having shared the kernel configuration used in your test.
>>> So far I have not yet been able to reproduce the above call trace in a
>>> VM. Could the above call trace have been triggered by the mpt3sas driver
>>> instead of the mq-deadline I/O scheduler?
>>
>> The above was triggered using nullblk with the test script you sent. I was not
>> using drives on the HBA or AHCI when it happens. And I can reproduce this 100%
>> of the time by running your script with QD=1.
> 
> Hi Damien,
> 
> I rebuilt kernel v5.14-rc7 (tag v5.14-rc7) after having run git clean -f -d -x
> and reran my nullb iops test with the mq-deadline scheduler. No kernel complaints
> appeared in the kernel log. Next I enabled lockdep (CONFIG_PROVE_LOCKING=y) and
> reran the nullb iops test with mq-deadline as scheduler. Again zero complaints
> appeared in the kernel log. Next I ran a subset of the blktests test
> (./check -q block). All tests passed and no complaints appeared in the kernel log.
> 
> Please help with root-causing this issue by rerunning the test on your setup after
> having enabled lockdep.

OK. Will have a look again.

> 
> Thanks,
> 
> Bart.
>
diff mbox series

Patch

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index a09761cbdf12e58..d8f6aa12de80049 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -79,6 +79,7 @@  struct dd_per_prio {
 	struct list_head fifo_list[DD_DIR_COUNT];
 	/* Next request in FIFO order. Read, write or both are NULL. */
 	struct request *next_rq[DD_DIR_COUNT];
+	unsigned int nr_queued;
 };
 
 struct deadline_data {
@@ -277,9 +278,9 @@  deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 }
 
 /* Number of requests queued for a given priority level. */
-static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
+static __always_inline u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
 {
-	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+	return dd->per_prio[prio].nr_queued;
 }
 
 /*
@@ -711,6 +712,8 @@  static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd_count(dd, inserted, prio);
+	per_prio = &dd->per_prio[prio];
+	per_prio->nr_queued++;
 
 	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
 		blk_mq_free_requests(&free);
@@ -719,7 +722,6 @@  static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	trace_block_rq_insert(rq);
 
-	per_prio = &dd->per_prio[prio];
 	if (at_head) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
 	} else {
@@ -790,12 +792,14 @@  static void dd_finish_request(struct request *rq)
 	const u8 ioprio_class = dd_rq_ioclass(rq);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];
+	unsigned long flags;
 
 	dd_count(dd, completed, prio);
+	spin_lock_irqsave(&dd->lock, flags);
+	per_prio->nr_queued--;
+	spin_unlock_irqrestore(&dd->lock, flags);
 
 	if (blk_queue_is_zoned(q)) {
-		unsigned long flags;
-
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
 		if (!list_empty(&per_prio->fifo_list[DD_WRITE]))