block: limit request dispatch loop duration

Message ID	20220316061134.3812309-1-shinichiro.kawasaki@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> IronPort-SDR: dqGq2nAd9J/3dXvZYMXwZbSTTOOvXw4oAyYJZDHPiqeIn5YXR2hv1ufQOLEkeW14wBhkwLik0g GM8ZZSak0dO3SR7zaLyWc/NhuFFDlf4OK/1L/l/u2nT0mCJfJ5o7SS7H7J0T4g8UbyPeWgUr2J jSIWnehqNyuYvoDsaUHOA33MUYfGkKuwwpzNP6LTMyFnKKxWfUwVquSRMqcQ0YIno+ImySCIzG 9NGjxsoXXBPiBhiEc8ieZoeOehrkzoLEHKL5viSyP0vMQQVdrK8Uef113YYa7kcyEDZNx+Q0r9 /TES3AqFCUxpgWl8u87wezpI IronPort-SDR: m1sMeA/7Eb9PCIqORL9dG2/DZlqmFURu2U8eUiGs9Ck9cY6zaOKF4xV6NJ5fgt39/zefblyUlg CxshfIeInLj9wbfFVlJAtGDnk3gVet4sfJ76cNfYz3i7XtagrlJ3zuJsrvaKbvsOpSzDwC97CF Gce+CIQXMG017Fh5K64fZwnS/PwZ4hLPgWHrpRqb6Ebw7jmjgXsCtL62D64W9x3mwD3XfwEj+l QyIqKrLVUD9FpoRCRkDr91UdIthTmWO5QpRCyMmYGdWLYsXuReU3EOIsy2Jr3rUsdsTgvMGSiQ RXY= WDCIronportException: Internal From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>, Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Subject: [PATCH] block: limit request dispatch loop duration Date: Wed, 16 Mar 2022 15:11:34 +0900 Message-Id: <20220316061134.3812309-1-shinichiro.kawasaki@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	block: limit request dispatch loop duration \| expand block: limit request dispatch loop duration

Message ID

20220316061134.3812309-1-shinichiro.kawasaki@wdc.com (mailing list archive)

State

New, archived

Headers

IronPort-SDR: 
 dqGq2nAd9J/3dXvZYMXwZbSTTOOvXw4oAyYJZDHPiqeIn5YXR2hv1ufQOLEkeW14wBhkwLik0g
 GM8ZZSak0dO3SR7zaLyWc/NhuFFDlf4OK/1L/l/u2nT0mCJfJ5o7SS7H7J0T4g8UbyPeWgUr2J
 jSIWnehqNyuYvoDsaUHOA33MUYfGkKuwwpzNP6LTMyFnKKxWfUwVquSRMqcQ0YIno+ImySCIzG
 9NGjxsoXXBPiBhiEc8ieZoeOehrkzoLEHKL5viSyP0vMQQVdrK8Uef113YYa7kcyEDZNx+Q0r9
 /TES3AqFCUxpgWl8u87wezpI
IronPort-SDR: 
 m1sMeA/7Eb9PCIqORL9dG2/DZlqmFURu2U8eUiGs9Ck9cY6zaOKF4xV6NJ5fgt39/zefblyUlg
 CxshfIeInLj9wbfFVlJAtGDnk3gVet4sfJ76cNfYz3i7XtagrlJ3zuJsrvaKbvsOpSzDwC97CF
 Gce+CIQXMG017Fh5K64fZwnS/PwZ4hLPgWHrpRqb6Ebw7jmjgXsCtL62D64W9x3mwD3XfwEj+l
 QyIqKrLVUD9FpoRCRkDr91UdIthTmWO5QpRCyMmYGdWLYsXuReU3EOIsy2Jr3rUsdsTgvMGSiQ
 RXY=
WDCIronportException: Internal
From: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>,
        Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>,
        Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Subject: [PATCH] block: limit request dispatch loop duration
Date: Wed, 16 Mar 2022 15:11:34 +0900
Message-Id: <20220316061134.3812309-1-shinichiro.kawasaki@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

block: limit request dispatch loop duration | expand

Commit Message

Shin'ichiro Kawasaki March 16, 2022, 6:11 a.m. UTC

When IO requests are made continuously and the target block device
handles requests faster than request arrival, the request dispatch loop
keeps on repeating to dispatch the arriving requests very long time,
more than a minute. Since the loop runs as a workqueue worker task, the
very long loop duration triggers workqueue watchdog timeout and BUG [1].

To avoid the very long loop duration, break the loop periodically. When
opportunity to dispatch requests still exists, check need_resched(). If
need_resched() returns true, the dispatch loop already consumed its time
slice, then reschedule the dispatch work and break the loop. With heavy
IO load, need_resched() does not return true for 20~30 seconds. To cover
such case, check time spent in the dispatch loop with jiffies. If more
than 1 second is spent, reschedule the dispatch work and break the loop.

[1]

[  609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s!
[  609.701820] Showing busy workqueues and worker pools:
[  609.707915] workqueue events: flags=0x0
[  609.712615]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.712626]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[  609.712687] workqueue events_freezable: flags=0x4
[  609.732943]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.732952]     pending: pci_pme_list_scan
[  609.732968] workqueue events_power_efficient: flags=0x80
[  609.751947]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.751955]     pending: neigh_managed_work
[  609.752018] workqueue kblockd: flags=0x18
[  609.769480]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
[  609.769488]     in-flight: 1020:blk_mq_run_work_fn
[  609.769498]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
[  609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67
[  639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s!
[  639.909513] Showing busy workqueues and worker pools:
[  639.915404] workqueue events: flags=0x0
[  639.920197]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  639.920215]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[  639.920365] workqueue kblockd: flags=0x18
[  639.939932]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
[  639.939942]     in-flight: 1020:blk_mq_run_work_fn
[  639.939955]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
[  639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67

Fixes: 6e6fcbc27e778 ("blk-mq: support batching dispatch in case of io")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: stable@vger.kernel.org # v5.10+
Link: https://lore.kernel.org/linux-block/20220310091649.zypaem5lkyfadymg@shindev/
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-sched.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Damien Le Moal March 16, 2022, 6:55 a.m. UTC | #1

On 3/16/22 15:11, Shin'ichiro Kawasaki wrote:
> When IO requests are made continuously and the target block device
> handles requests faster than request arrival, the request dispatch loop
> keeps on repeating to dispatch the arriving requests very long time,
> more than a minute. Since the loop runs as a workqueue worker task, the
> very long loop duration triggers workqueue watchdog timeout and BUG [1].
> 
> To avoid the very long loop duration, break the loop periodically. When
> opportunity to dispatch requests still exists, check need_resched(). If
> need_resched() returns true, the dispatch loop already consumed its time
> slice, then reschedule the dispatch work and break the loop. With heavy
> IO load, need_resched() does not return true for 20~30 seconds. To cover
> such case, check time spent in the dispatch loop with jiffies. If more
> than 1 second is spent, reschedule the dispatch work and break the loop.
> 
> [1]
> 
> [  609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s!
> [  609.701820] Showing busy workqueues and worker pools:
> [  609.707915] workqueue events: flags=0x0
> [  609.712615]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> [  609.712626]     pending: drm_fb_helper_damage_work [drm_kms_helper]
> [  609.712687] workqueue events_freezable: flags=0x4
> [  609.732943]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> [  609.732952]     pending: pci_pme_list_scan
> [  609.732968] workqueue events_power_efficient: flags=0x80
> [  609.751947]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> [  609.751955]     pending: neigh_managed_work
> [  609.752018] workqueue kblockd: flags=0x18
> [  609.769480]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
> [  609.769488]     in-flight: 1020:blk_mq_run_work_fn
> [  609.769498]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
> [  609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67
> [  639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s!
> [  639.909513] Showing busy workqueues and worker pools:
> [  639.915404] workqueue events: flags=0x0
> [  639.920197]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> [  639.920215]     pending: drm_fb_helper_damage_work [drm_kms_helper]
> [  639.920365] workqueue kblockd: flags=0x18
> [  639.939932]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
> [  639.939942]     in-flight: 1020:blk_mq_run_work_fn
> [  639.939955]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
> [  639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67
> 
> Fixes: 6e6fcbc27e778 ("blk-mq: support batching dispatch in case of io")
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Cc: stable@vger.kernel.org # v5.10+
> Link: https://lore.kernel.org/linux-block/20220310091649.zypaem5lkyfadymg@shindev/
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Tested-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-sched.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55488ba97823..64941615befc 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -181,9 +181,15 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>  	int ret;
> +	unsigned long end = jiffies + HZ;
>  
>  	do {
>  		ret = __blk_mq_do_dispatch_sched(hctx);
> +		if (ret == 1 &&
> +		    (need_resched() || time_is_after_jiffies(end))) {
> +			blk_mq_delay_run_hw_queue(hctx, 0);
> +			break;

Minor nit: you could "return 1;" directly here.

> +		}
>  	} while (ret == 1);
>  
>  	return ret;

Otherwise, looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Jens Axboe March 16, 2022, 12:20 p.m. UTC | #2

On 3/16/22 12:11 AM, Shin'ichiro Kawasaki wrote:
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55488ba97823..64941615befc 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -181,9 +181,15 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>  	int ret;
> +	unsigned long end = jiffies + HZ;
>  
>  	do {
>  		ret = __blk_mq_do_dispatch_sched(hctx);
> +		if (ret == 1 &&
> +		    (need_resched() || time_is_after_jiffies(end))) {
> +			blk_mq_delay_run_hw_queue(hctx, 0);
> +			break;
> +		}
>  	} while (ret == 1);
>  
>  	return ret;

I think it'd look cleaner as:

static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
	unsigned long end = jiffies + HZ;
	int ret;

	do {
		ret = __blk_mq_do_dispatch_sched(hctx);
		if (ret != 1)
			break;
		if (need_resched() || time_is_after_jiffies(end)) {
			blk_mq_delay_run_hw_queue(hctx, 0);
			break;
		}
	} while (1);

	return ret;
}

Shin'ichiro Kawasaki March 17, 2022, 2:56 a.m. UTC | #3

On Mar 16, 2022 / 15:11, Shin'ichiro Kawasaki wrote:
(snip)
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55488ba97823..64941615befc 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -181,9 +181,15 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>  	int ret;
> +	unsigned long end = jiffies + HZ;
>  
>  	do {
>  		ret = __blk_mq_do_dispatch_sched(hctx);
> +		if (ret == 1 &&
> +		    (need_resched() || time_is_after_jiffies(end))) {

I've noticed that I made a horrible mistake. The time_is_after_jiffies(end) in
the line above is wrong. It should be time_is_before_jiffies(end). My bad.
Before I post v2 with this fix, I will ensure the fix by checking call frequency
of blk_mq_delay_run_hw_queue using kprobe.

> +			blk_mq_delay_run_hw_queue(hctx, 0);
> +			break;
> +		}
>  	} while (ret == 1);
>  
>  	return ret;
> -- 
> 2.34.1
>

Shin'ichiro Kawasaki March 17, 2022, 2:58 a.m. UTC | #4

On Mar 16, 2022 / 06:20, Jens Axboe wrote:
> On 3/16/22 12:11 AM, Shin'ichiro Kawasaki wrote:
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 55488ba97823..64941615befc 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -181,9 +181,15 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  {
> >  	int ret;
> > +	unsigned long end = jiffies + HZ;
> >  
> >  	do {
> >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > +		if (ret == 1 &&
> > +		    (need_resched() || time_is_after_jiffies(end))) {
> > +			blk_mq_delay_run_hw_queue(hctx, 0);
> > +			break;
> > +		}
> >  	} while (ret == 1);
> >  
> >  	return ret;
> 
> I think it'd look cleaner as:
> 
> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> {
> 	unsigned long end = jiffies + HZ;
> 	int ret;
> 
> 	do {
> 		ret = __blk_mq_do_dispatch_sched(hctx);
> 		if (ret != 1)
> 			break;
> 		if (need_resched() || time_is_after_jiffies(end)) {
> 			blk_mq_delay_run_hw_queue(hctx, 0);
> 			break;
> 		}
> 	} while (1);
> 
> 	return ret;
> }

Thank you for this suggestion. It looks nice since avoids duplicated 'ret == 1'.
Will reflect in v2.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55488ba97823..64941615befc 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -181,9 +181,15 @@  static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	int ret;
+	unsigned long end = jiffies + HZ;
 
 	do {
 		ret = __blk_mq_do_dispatch_sched(hctx);
+		if (ret == 1 &&
+		    (need_resched() || time_is_after_jiffies(end))) {
+			blk_mq_delay_run_hw_queue(hctx, 0);
+			break;
+		}
 	} while (ret == 1);
 
 	return ret;

block: limit request dispatch loop duration

Commit Message

Comments

Patch