diff mbox series

[bug,report] worker watchdog timeout in dispatch loop for null_blk

Message ID 20220310091649.zypaem5lkyfadymg@shindev (mailing list archive)
State New, archived
Headers show
Series [bug,report] worker watchdog timeout in dispatch loop for null_blk | expand

Commit Message

Shinichiro Kawasaki March 10, 2022, 9:16 a.m. UTC
This issue does not look critical, but let me share it to ask comments for fix.

When fio command with 40 jobs [1] is run for a null_blk device with memory
backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
more than 30 seconds and other work can not run. The 40 fio jobs keep on
creating many read requests to a single null_blk device, then the every time
the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
means more than one request was dispatched. Hence, the while loop in
blk_mq_do_dispatch_sched() does not break.

static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
        int ret;

        do {
               ret = __blk_mq_do_dispatch_sched(hctx);
        } while (ret == 1);

        return ret;
}

The BUG message was observed when I ran blktests block/005 with various
conditions on a system with 40 CPUs. It was observed with kernel version
v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
poll queue support"). This commit added blk_mq_ops.map_queues callback. I
guess it changed dispatch behavior for null_blk devices and triggered the
BUG message.

I'm not so sure if we really need to fix this issue. It does not seem the real
world problem since it is observed only with null_blk. The real block devices
have slower IO operation then the dispatch should stop sooner when the hardware
queue gets full. Also the 40 jobs for single device is not realistic workload.

Having said that, it does not feel right that other works are pended during
dispatch for null_blk devices. To avoid the BUG message, I can think of two
fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
using a loop counter [3] (or jiffies timeout check). But it may not be good to
add such a check in the loop, assuming it is the hot path. The other fix idea is
to count number of requests within null_blk so that null_queue_rq() sometimes
returns BLK_STS_RESOURCE [4]. It will make the request requeued and the
dispatch loop have a break. This approach looks better for me since it touches
only null_blk.

Comments on solution will be appreciated. Is there better fix idea?  Now I'm
tempted to the fix in null_blk.


[1]

# fio --bs=4k --rw=randread --norandommap --numjobs=40 --name=reads --direct=1 \
    --filename=/dev/nullb0 --size=1g --group_reporting

[2]

[  609.691437] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 35s!
[  609.701820] Showing busy workqueues and worker pools:
[  609.707915] workqueue events: flags=0x0
[  609.712615]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.712626]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[  609.712687] workqueue events_freezable: flags=0x4
[  609.732943]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.732952]     pending: pci_pme_list_scan
[  609.732968] workqueue events_power_efficient: flags=0x80
[  609.751947]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  609.751955]     pending: neigh_managed_work
[  609.752018] workqueue kblockd: flags=0x18
[  609.769480]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
[  609.769488]     in-flight: 1020:blk_mq_run_work_fn
[  609.769498]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
[  609.769744] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=35s workers=2 idle: 67
[  639.899730] BUG: workqueue lockup - pool cpus=10 node=1 flags=0x0 nice=-20 stuck for 66s!
[  639.909513] Showing busy workqueues and worker pools:
[  639.915404] workqueue events: flags=0x0
[  639.920197]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[  639.920215]     pending: drm_fb_helper_damage_work [drm_kms_helper]
[  639.920365] workqueue kblockd: flags=0x18
[  639.939932]   pwq 21: cpus=10 node=1 flags=0x0 nice=-20 active=3/256 refcnt=4
[  639.939942]     in-flight: 1020:blk_mq_run_work_fn
[  639.939955]     pending: blk_mq_timeout_work, blk_mq_run_work_fn
[  639.940212] pool 21: cpus=10 node=1 flags=0x0 nice=-20 hung=66s workers=2 idle: 67

[3]

Comments

Ming Lei March 10, 2022, 10 a.m. UTC | #1
On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> This issue does not look critical, but let me share it to ask comments for fix.
> 
> When fio command with 40 jobs [1] is run for a null_blk device with memory
> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> creating many read requests to a single null_blk device, then the every time
> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> means more than one request was dispatched. Hence, the while loop in
> blk_mq_do_dispatch_sched() does not break.
> 
> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> {
>         int ret;
> 
>         do {
>                ret = __blk_mq_do_dispatch_sched(hctx);
>         } while (ret == 1);
> 
>         return ret;
> }
> 
> The BUG message was observed when I ran blktests block/005 with various
> conditions on a system with 40 CPUs. It was observed with kernel version
> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> guess it changed dispatch behavior for null_blk devices and triggered the
> BUG message.

It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
with 0a593fbbc245.

If queueing requests is faster than dispatching, the issue will be triggered
sooner or later, especially easy to trigger in SQ device. I am sure it can
be triggered on scsi debug, even saw such report on ahci.

> 
> I'm not so sure if we really need to fix this issue. It does not seem the real
> world problem since it is observed only with null_blk. The real block devices
> have slower IO operation then the dispatch should stop sooner when the hardware
> queue gets full. Also the 40 jobs for single device is not realistic workload.
> 
> Having said that, it does not feel right that other works are pended during
> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> using a loop counter [3] (or jiffies timeout check).

This way could work, but the queue need to be re-run after breaking
caused by max dispatch number. cond_resched() might be the simplest way,
but it can't be used here because of rcu/srcu read lock.



Thanks,
Ming
Shinichiro Kawasaki March 10, 2022, 12:40 p.m. UTC | #2
On Mar 10, 2022 / 18:00, Ming Lei wrote:
> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> > This issue does not look critical, but let me share it to ask comments for fix.
> > 
> > When fio command with 40 jobs [1] is run for a null_blk device with memory
> > backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> > workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> > more than 30 seconds and other work can not run. The 40 fio jobs keep on
> > creating many read requests to a single null_blk device, then the every time
> > the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> > means more than one request was dispatched. Hence, the while loop in
> > blk_mq_do_dispatch_sched() does not break.
> > 
> > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > {
> >         int ret;
> > 
> >         do {
> >                ret = __blk_mq_do_dispatch_sched(hctx);
> >         } while (ret == 1);
> > 
> >         return ret;
> > }
> > 
> > The BUG message was observed when I ran blktests block/005 with various
> > conditions on a system with 40 CPUs. It was observed with kernel version
> > v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> > poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> > guess it changed dispatch behavior for null_blk devices and triggered the
> > BUG message.
> 
> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> with 0a593fbbc245.
> 
> If queueing requests is faster than dispatching, the issue will be triggered
> sooner or later, especially easy to trigger in SQ device. I am sure it can
> be triggered on scsi debug, even saw such report on ahci.

Thank you for the comments. Then this is the real problem.

> 
> > 
> > I'm not so sure if we really need to fix this issue. It does not seem the real
> > world problem since it is observed only with null_blk. The real block devices
> > have slower IO operation then the dispatch should stop sooner when the hardware
> > queue gets full. Also the 40 jobs for single device is not realistic workload.
> > 
> > Having said that, it does not feel right that other works are pended during
> > dispatch for null_blk devices. To avoid the BUG message, I can think of two
> > fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> > using a loop counter [3] (or jiffies timeout check).
> 
> This way could work, but the queue need to be re-run after breaking
> caused by max dispatch number. cond_resched() might be the simplest way,
> but it can't be used here because of rcu/srcu read lock.

As far as I understand, blk_mq_run_work_fn() should return after the loop break
to yield the worker to other works. How about to call
blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55488ba978232..faa29448a72a0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	return !!dispatched;
 }
 
+#define MQ_DISPATCH_MAX 0x10000
+
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	int ret;
+	unsigned int count = MQ_DISPATCH_MAX;
 
 	do {
 		ret = __blk_mq_do_dispatch_sched(hctx);
-	} while (ret == 1);
+	} while (ret == 1 && count--);
+
+	if (ret == 1 && !count)
+		blk_mq_delay_run_hw_queue(hctx, 0);
 
 	return ret;
 }
Jens Axboe March 10, 2022, 12:47 p.m. UTC | #3
On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> On Mar 10, 2022 / 18:00, Ming Lei wrote:
>> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
>>> This issue does not look critical, but let me share it to ask comments for fix.
>>>
>>> When fio command with 40 jobs [1] is run for a null_blk device with memory
>>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
>>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
>>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
>>> creating many read requests to a single null_blk device, then the every time
>>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
>>> means more than one request was dispatched. Hence, the while loop in
>>> blk_mq_do_dispatch_sched() does not break.
>>>
>>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>>> {
>>>         int ret;
>>>
>>>         do {
>>>                ret = __blk_mq_do_dispatch_sched(hctx);
>>>         } while (ret == 1);
>>>
>>>         return ret;
>>> }
>>>
>>> The BUG message was observed when I ran blktests block/005 with various
>>> conditions on a system with 40 CPUs. It was observed with kernel version
>>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
>>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
>>> guess it changed dispatch behavior for null_blk devices and triggered the
>>> BUG message.
>>
>> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
>> with 0a593fbbc245.
>>
>> If queueing requests is faster than dispatching, the issue will be triggered
>> sooner or later, especially easy to trigger in SQ device. I am sure it can
>> be triggered on scsi debug, even saw such report on ahci.
> 
> Thank you for the comments. Then this is the real problem.
> 
>>
>>>
>>> I'm not so sure if we really need to fix this issue. It does not seem the real
>>> world problem since it is observed only with null_blk. The real block devices
>>> have slower IO operation then the dispatch should stop sooner when the hardware
>>> queue gets full. Also the 40 jobs for single device is not realistic workload.
>>>
>>> Having said that, it does not feel right that other works are pended during
>>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
>>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
>>> using a loop counter [3] (or jiffies timeout check).
>>
>> This way could work, but the queue need to be re-run after breaking
>> caused by max dispatch number. cond_resched() might be the simplest way,
>> but it can't be used here because of rcu/srcu read lock.
> 
> As far as I understand, blk_mq_run_work_fn() should return after the loop break
> to yield the worker to other works. How about to call
> blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> 
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55488ba978232..faa29448a72a0 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  	return !!dispatched;
>  }
>  
> +#define MQ_DISPATCH_MAX 0x10000
> +
>  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>  	int ret;
> +	unsigned int count = MQ_DISPATCH_MAX;
>  
>  	do {
>  		ret = __blk_mq_do_dispatch_sched(hctx);
> -	} while (ret == 1);
> +	} while (ret == 1 && count--);
> +
> +	if (ret == 1 && !count)
> +		blk_mq_delay_run_hw_queue(hctx, 0);
>  
>  	return ret;
>  }

Why not just gate it on needing to reschedule, rather than some random
value?

static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
{
	int ret;

	do {
		ret = __blk_mq_do_dispatch_sched(hctx);
	} while (ret == 1 && !need_resched());

	if (ret == 1 && need_resched())
		blk_mq_delay_run_hw_queue(hctx, 0);

	return ret;
}

or something like that.
Shinichiro Kawasaki March 11, 2022, 6:24 a.m. UTC | #4
On Mar 10, 2022 / 05:47, Jens Axboe wrote:
> On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> > On Mar 10, 2022 / 18:00, Ming Lei wrote:
> >> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> >>> This issue does not look critical, but let me share it to ask comments for fix.
> >>>
> >>> When fio command with 40 jobs [1] is run for a null_blk device with memory
> >>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> >>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> >>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> >>> creating many read requests to a single null_blk device, then the every time
> >>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> >>> means more than one request was dispatched. Hence, the while loop in
> >>> blk_mq_do_dispatch_sched() does not break.
> >>>
> >>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >>> {
> >>>         int ret;
> >>>
> >>>         do {
> >>>                ret = __blk_mq_do_dispatch_sched(hctx);
> >>>         } while (ret == 1);
> >>>
> >>>         return ret;
> >>> }
> >>>
> >>> The BUG message was observed when I ran blktests block/005 with various
> >>> conditions on a system with 40 CPUs. It was observed with kernel version
> >>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> >>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> >>> guess it changed dispatch behavior for null_blk devices and triggered the
> >>> BUG message.
> >>
> >> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> >> with 0a593fbbc245.
> >>
> >> If queueing requests is faster than dispatching, the issue will be triggered
> >> sooner or later, especially easy to trigger in SQ device. I am sure it can
> >> be triggered on scsi debug, even saw such report on ahci.
> > 
> > Thank you for the comments. Then this is the real problem.
> > 
> >>
> >>>
> >>> I'm not so sure if we really need to fix this issue. It does not seem the real
> >>> world problem since it is observed only with null_blk. The real block devices
> >>> have slower IO operation then the dispatch should stop sooner when the hardware
> >>> queue gets full. Also the 40 jobs for single device is not realistic workload.
> >>>
> >>> Having said that, it does not feel right that other works are pended during
> >>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> >>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> >>> using a loop counter [3] (or jiffies timeout check).
> >>
> >> This way could work, but the queue need to be re-run after breaking
> >> caused by max dispatch number. cond_resched() might be the simplest way,
> >> but it can't be used here because of rcu/srcu read lock.
> > 
> > As far as I understand, blk_mq_run_work_fn() should return after the loop break
> > to yield the worker to other works. How about to call
> > blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> > 
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 55488ba978232..faa29448a72a0 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  	return !!dispatched;
> >  }
> >  
> > +#define MQ_DISPATCH_MAX 0x10000
> > +
> >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  {
> >  	int ret;
> > +	unsigned int count = MQ_DISPATCH_MAX;
> >  
> >  	do {
> >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > -	} while (ret == 1);
> > +	} while (ret == 1 && count--);
> > +
> > +	if (ret == 1 && !count)
> > +		blk_mq_delay_run_hw_queue(hctx, 0);
> >  
> >  	return ret;
> >  }
> 
> Why not just gate it on needing to reschedule, rather than some random
> value?
> 
> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> {
> 	int ret;
> 
> 	do {
> 		ret = __blk_mq_do_dispatch_sched(hctx);
> 	} while (ret == 1 && !need_resched());
> 
> 	if (ret == 1 && need_resched())
> 		blk_mq_delay_run_hw_queue(hctx, 0);
> 
> 	return ret;
> }
> 
> or something like that.

Jens, thanks for the idea, but need_resched() check does not look working here.
I tried the code above but still the BUG message is observed. My guess is that
in the call stack below, every __blk_mq_do_dispatch_sched() call results in
might_sleep_if() call, then need_resched() does not work as expected, probably.

__blk_mq_do_dispatch_sched
  blk_mq_dispatch_rq_list
    q->mq_ops->queue_rq
      null_queue_rq
        might_sleep_if

Now I'm trying to find similar way as need_resched() to avoid the random number.
So far I haven't found good idea yet.
Ming Lei March 11, 2022, 9:51 a.m. UTC | #5
On Fri, Mar 11, 2022 at 06:24:41AM +0000, Shinichiro Kawasaki wrote:
> On Mar 10, 2022 / 05:47, Jens Axboe wrote:
> > On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> > > On Mar 10, 2022 / 18:00, Ming Lei wrote:
> > >> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> > >>> This issue does not look critical, but let me share it to ask comments for fix.
> > >>>
> > >>> When fio command with 40 jobs [1] is run for a null_blk device with memory
> > >>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> > >>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> > >>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> > >>> creating many read requests to a single null_blk device, then the every time
> > >>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> > >>> means more than one request was dispatched. Hence, the while loop in
> > >>> blk_mq_do_dispatch_sched() does not break.
> > >>>
> > >>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > >>> {
> > >>>         int ret;
> > >>>
> > >>>         do {
> > >>>                ret = __blk_mq_do_dispatch_sched(hctx);
> > >>>         } while (ret == 1);
> > >>>
> > >>>         return ret;
> > >>> }
> > >>>
> > >>> The BUG message was observed when I ran blktests block/005 with various
> > >>> conditions on a system with 40 CPUs. It was observed with kernel version
> > >>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> > >>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> > >>> guess it changed dispatch behavior for null_blk devices and triggered the
> > >>> BUG message.
> > >>
> > >> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> > >> with 0a593fbbc245.
> > >>
> > >> If queueing requests is faster than dispatching, the issue will be triggered
> > >> sooner or later, especially easy to trigger in SQ device. I am sure it can
> > >> be triggered on scsi debug, even saw such report on ahci.
> > > 
> > > Thank you for the comments. Then this is the real problem.
> > > 
> > >>
> > >>>
> > >>> I'm not so sure if we really need to fix this issue. It does not seem the real
> > >>> world problem since it is observed only with null_blk. The real block devices
> > >>> have slower IO operation then the dispatch should stop sooner when the hardware
> > >>> queue gets full. Also the 40 jobs for single device is not realistic workload.
> > >>>
> > >>> Having said that, it does not feel right that other works are pended during
> > >>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> > >>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> > >>> using a loop counter [3] (or jiffies timeout check).
> > >>
> > >> This way could work, but the queue need to be re-run after breaking
> > >> caused by max dispatch number. cond_resched() might be the simplest way,
> > >> but it can't be used here because of rcu/srcu read lock.
> > > 
> > > As far as I understand, blk_mq_run_work_fn() should return after the loop break
> > > to yield the worker to other works. How about to call
> > > blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> > > 
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index 55488ba978232..faa29448a72a0 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > >  	return !!dispatched;
> > >  }
> > >  
> > > +#define MQ_DISPATCH_MAX 0x10000
> > > +
> > >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > >  {
> > >  	int ret;
> > > +	unsigned int count = MQ_DISPATCH_MAX;
> > >  
> > >  	do {
> > >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > > -	} while (ret == 1);
> > > +	} while (ret == 1 && count--);
> > > +
> > > +	if (ret == 1 && !count)
> > > +		blk_mq_delay_run_hw_queue(hctx, 0);
> > >  
> > >  	return ret;
> > >  }
> > 
> > Why not just gate it on needing to reschedule, rather than some random
> > value?
> > 
> > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > {
> > 	int ret;
> > 
> > 	do {
> > 		ret = __blk_mq_do_dispatch_sched(hctx);
> > 	} while (ret == 1 && !need_resched());
> > 
> > 	if (ret == 1 && need_resched())
> > 		blk_mq_delay_run_hw_queue(hctx, 0);
> > 
> > 	return ret;
> > }
> > 
> > or something like that.
> 
> Jens, thanks for the idea, but need_resched() check does not look working here.
> I tried the code above but still the BUG message is observed. My guess is that
> in the call stack below, every __blk_mq_do_dispatch_sched() call results in
> might_sleep_if() call, then need_resched() does not work as expected, probably.
> 
> __blk_mq_do_dispatch_sched
>   blk_mq_dispatch_rq_list
>     q->mq_ops->queue_rq
>       null_queue_rq
>         might_sleep_if
> 
> Now I'm trying to find similar way as need_resched() to avoid the random number.
> So far I haven't found good idea yet.

Jens patch using need_resched() looks improving the situation, also the
scsi_debug case won't set BLOCKING:

1) without the patch, it can be easy to trigger lockup with the
following test.

- modprobe scsi_debug virtual_gb=128 delay=0 dev_size_mb=2048
- fio --bs=512k --ioengine=sync --iodepth=128 --numjobs=4 --rw=randrw \
	--name=sdc-sync-randrw-512k --filename=/dev/sdc --direct=1 --size=60G --runtime=120

#sdc is the created scsi_debug disk

2) with the patch, lockup still can be triggered, just not as easy as
before

3) trace need_resched() from kretprobe:__blk_mq_do_dispatch_sched() in
kblockd wq context, I observed that just 20~30 times resched gets set
when running the above script(120sec), so maybe need_resched() isn't set
during heavy IO dispatching.


Thanks,
Ming
Shinichiro Kawasaki March 14, 2022, 5:24 a.m. UTC | #6
On Mar 11, 2022 / 17:51, Ming Lei wrote:
> On Fri, Mar 11, 2022 at 06:24:41AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 10, 2022 / 05:47, Jens Axboe wrote:
> > > On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> > > > On Mar 10, 2022 / 18:00, Ming Lei wrote:
> > > >> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> > > >>> This issue does not look critical, but let me share it to ask comments for fix.
> > > >>>
> > > >>> When fio command with 40 jobs [1] is run for a null_blk device with memory
> > > >>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> > > >>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> > > >>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> > > >>> creating many read requests to a single null_blk device, then the every time
> > > >>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> > > >>> means more than one request was dispatched. Hence, the while loop in
> > > >>> blk_mq_do_dispatch_sched() does not break.
> > > >>>
> > > >>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > >>> {
> > > >>>         int ret;
> > > >>>
> > > >>>         do {
> > > >>>                ret = __blk_mq_do_dispatch_sched(hctx);
> > > >>>         } while (ret == 1);
> > > >>>
> > > >>>         return ret;
> > > >>> }
> > > >>>
> > > >>> The BUG message was observed when I ran blktests block/005 with various
> > > >>> conditions on a system with 40 CPUs. It was observed with kernel version
> > > >>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> > > >>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> > > >>> guess it changed dispatch behavior for null_blk devices and triggered the
> > > >>> BUG message.
> > > >>
> > > >> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> > > >> with 0a593fbbc245.
> > > >>
> > > >> If queueing requests is faster than dispatching, the issue will be triggered
> > > >> sooner or later, especially easy to trigger in SQ device. I am sure it can
> > > >> be triggered on scsi debug, even saw such report on ahci.
> > > > 
> > > > Thank you for the comments. Then this is the real problem.
> > > > 
> > > >>
> > > >>>
> > > >>> I'm not so sure if we really need to fix this issue. It does not seem the real
> > > >>> world problem since it is observed only with null_blk. The real block devices
> > > >>> have slower IO operation then the dispatch should stop sooner when the hardware
> > > >>> queue gets full. Also the 40 jobs for single device is not realistic workload.
> > > >>>
> > > >>> Having said that, it does not feel right that other works are pended during
> > > >>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> > > >>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> > > >>> using a loop counter [3] (or jiffies timeout check).
> > > >>
> > > >> This way could work, but the queue need to be re-run after breaking
> > > >> caused by max dispatch number. cond_resched() might be the simplest way,
> > > >> but it can't be used here because of rcu/srcu read lock.
> > > > 
> > > > As far as I understand, blk_mq_run_work_fn() should return after the loop break
> > > > to yield the worker to other works. How about to call
> > > > blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> > > > 
> > > > 
> > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > index 55488ba978232..faa29448a72a0 100644
> > > > --- a/block/blk-mq-sched.c
> > > > +++ b/block/blk-mq-sched.c
> > > > @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > >  	return !!dispatched;
> > > >  }
> > > >  
> > > > +#define MQ_DISPATCH_MAX 0x10000
> > > > +
> > > >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > >  {
> > > >  	int ret;
> > > > +	unsigned int count = MQ_DISPATCH_MAX;
> > > >  
> > > >  	do {
> > > >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > -	} while (ret == 1);
> > > > +	} while (ret == 1 && count--);
> > > > +
> > > > +	if (ret == 1 && !count)
> > > > +		blk_mq_delay_run_hw_queue(hctx, 0);
> > > >  
> > > >  	return ret;
> > > >  }
> > > 
> > > Why not just gate it on needing to reschedule, rather than some random
> > > value?
> > > 
> > > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > {
> > > 	int ret;
> > > 
> > > 	do {
> > > 		ret = __blk_mq_do_dispatch_sched(hctx);
> > > 	} while (ret == 1 && !need_resched());
> > > 
> > > 	if (ret == 1 && need_resched())
> > > 		blk_mq_delay_run_hw_queue(hctx, 0);
> > > 
> > > 	return ret;
> > > }
> > > 
> > > or something like that.
> > 
> > Jens, thanks for the idea, but need_resched() check does not look working here.
> > I tried the code above but still the BUG message is observed. My guess is that
> > in the call stack below, every __blk_mq_do_dispatch_sched() call results in
> > might_sleep_if() call, then need_resched() does not work as expected, probably.
> > 
> > __blk_mq_do_dispatch_sched
> >   blk_mq_dispatch_rq_list
> >     q->mq_ops->queue_rq
> >       null_queue_rq
> >         might_sleep_if
> > 
> > Now I'm trying to find similar way as need_resched() to avoid the random number.
> > So far I haven't found good idea yet.
> 
> Jens patch using need_resched() looks improving the situation, also the
> scsi_debug case won't set BLOCKING:
> 
> 1) without the patch, it can be easy to trigger lockup with the
> following test.
> 
> - modprobe scsi_debug virtual_gb=128 delay=0 dev_size_mb=2048
> - fio --bs=512k --ioengine=sync --iodepth=128 --numjobs=4 --rw=randrw \
> 	--name=sdc-sync-randrw-512k --filename=/dev/sdc --direct=1 --size=60G --runtime=120
> 
> #sdc is the created scsi_debug disk

Thanks. I tried the work load above and observed the lockup BUG message on my
system. So, I reconfirmed that the problem happens with both BLOCKING and
non-BLOCKING drivers.

Regarding the solution, I can not think of any good one. I tried to remove the
WQ_HIGHPRI flag from kblockd_workqueue, but it did not look affecting
need_resched() behavior. I walked through workqueue API, but was not able
to find anything useful.

As far as I understand, it is assumed and expected the each work item gets
completed within decent time. Then this blk_mq_run_work_fn must stop within
decent time by breaking the loop at some point. As the loop break conditions
other than need_resched(), I can think of 1) loop count, 2) number of requests
dispatched or 3) time spent in the loop. All of the three require a magic random
number as the limit... Is there any other better way?

If we need to choose one of the 3 options, I think '3) time spent in the loop'
is better than others, since workqueue watchdog monitors _time_ to check lockup
and report the BUG message.
Ming Lei March 14, 2022, 7 a.m. UTC | #7
On Mon, Mar 14, 2022 at 05:24:34AM +0000, Shinichiro Kawasaki wrote:
> On Mar 11, 2022 / 17:51, Ming Lei wrote:
> > On Fri, Mar 11, 2022 at 06:24:41AM +0000, Shinichiro Kawasaki wrote:
> > > On Mar 10, 2022 / 05:47, Jens Axboe wrote:
> > > > On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> > > > > On Mar 10, 2022 / 18:00, Ming Lei wrote:
> > > > >> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> > > > >>> This issue does not look critical, but let me share it to ask comments for fix.
> > > > >>>
> > > > >>> When fio command with 40 jobs [1] is run for a null_blk device with memory
> > > > >>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> > > > >>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> > > > >>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> > > > >>> creating many read requests to a single null_blk device, then the every time
> > > > >>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> > > > >>> means more than one request was dispatched. Hence, the while loop in
> > > > >>> blk_mq_do_dispatch_sched() does not break.
> > > > >>>
> > > > >>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > >>> {
> > > > >>>         int ret;
> > > > >>>
> > > > >>>         do {
> > > > >>>                ret = __blk_mq_do_dispatch_sched(hctx);
> > > > >>>         } while (ret == 1);
> > > > >>>
> > > > >>>         return ret;
> > > > >>> }
> > > > >>>
> > > > >>> The BUG message was observed when I ran blktests block/005 with various
> > > > >>> conditions on a system with 40 CPUs. It was observed with kernel version
> > > > >>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> > > > >>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> > > > >>> guess it changed dispatch behavior for null_blk devices and triggered the
> > > > >>> BUG message.
> > > > >>
> > > > >> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> > > > >> with 0a593fbbc245.
> > > > >>
> > > > >> If queueing requests is faster than dispatching, the issue will be triggered
> > > > >> sooner or later, especially easy to trigger in SQ device. I am sure it can
> > > > >> be triggered on scsi debug, even saw such report on ahci.
> > > > > 
> > > > > Thank you for the comments. Then this is the real problem.
> > > > > 
> > > > >>
> > > > >>>
> > > > >>> I'm not so sure if we really need to fix this issue. It does not seem the real
> > > > >>> world problem since it is observed only with null_blk. The real block devices
> > > > >>> have slower IO operation then the dispatch should stop sooner when the hardware
> > > > >>> queue gets full. Also the 40 jobs for single device is not realistic workload.
> > > > >>>
> > > > >>> Having said that, it does not feel right that other works are pended during
> > > > >>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> > > > >>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> > > > >>> using a loop counter [3] (or jiffies timeout check).
> > > > >>
> > > > >> This way could work, but the queue need to be re-run after breaking
> > > > >> caused by max dispatch number. cond_resched() might be the simplest way,
> > > > >> but it can't be used here because of rcu/srcu read lock.
> > > > > 
> > > > > As far as I understand, blk_mq_run_work_fn() should return after the loop break
> > > > > to yield the worker to other works. How about to call
> > > > > blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> > > > > 
> > > > > 
> > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > index 55488ba978232..faa29448a72a0 100644
> > > > > --- a/block/blk-mq-sched.c
> > > > > +++ b/block/blk-mq-sched.c
> > > > > @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > >  	return !!dispatched;
> > > > >  }
> > > > >  
> > > > > +#define MQ_DISPATCH_MAX 0x10000
> > > > > +
> > > > >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > >  {
> > > > >  	int ret;
> > > > > +	unsigned int count = MQ_DISPATCH_MAX;
> > > > >  
> > > > >  	do {
> > > > >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > -	} while (ret == 1);
> > > > > +	} while (ret == 1 && count--);
> > > > > +
> > > > > +	if (ret == 1 && !count)
> > > > > +		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > >  
> > > > >  	return ret;
> > > > >  }
> > > > 
> > > > Why not just gate it on needing to reschedule, rather than some random
> > > > value?
> > > > 
> > > > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > {
> > > > 	int ret;
> > > > 
> > > > 	do {
> > > > 		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > 	} while (ret == 1 && !need_resched());
> > > > 
> > > > 	if (ret == 1 && need_resched())
> > > > 		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > 
> > > > 	return ret;
> > > > }
> > > > 
> > > > or something like that.
> > > 
> > > Jens, thanks for the idea, but need_resched() check does not look working here.
> > > I tried the code above but still the BUG message is observed. My guess is that
> > > in the call stack below, every __blk_mq_do_dispatch_sched() call results in
> > > might_sleep_if() call, then need_resched() does not work as expected, probably.
> > > 
> > > __blk_mq_do_dispatch_sched
> > >   blk_mq_dispatch_rq_list
> > >     q->mq_ops->queue_rq
> > >       null_queue_rq
> > >         might_sleep_if
> > > 
> > > Now I'm trying to find similar way as need_resched() to avoid the random number.
> > > So far I haven't found good idea yet.
> > 
> > Jens patch using need_resched() looks improving the situation, also the
> > scsi_debug case won't set BLOCKING:
> > 
> > 1) without the patch, it can be easy to trigger lockup with the
> > following test.
> > 
> > - modprobe scsi_debug virtual_gb=128 delay=0 dev_size_mb=2048
> > - fio --bs=512k --ioengine=sync --iodepth=128 --numjobs=4 --rw=randrw \
> > 	--name=sdc-sync-randrw-512k --filename=/dev/sdc --direct=1 --size=60G --runtime=120
> > 
> > #sdc is the created scsi_debug disk
> 
> Thanks. I tried the work load above and observed the lockup BUG message on my
> system. So, I reconfirmed that the problem happens with both BLOCKING and
> non-BLOCKING drivers.
> 
> Regarding the solution, I can not think of any good one. I tried to remove the
> WQ_HIGHPRI flag from kblockd_workqueue, but it did not look affecting
> need_resched() behavior. I walked through workqueue API, but was not able
> to find anything useful.
> 
> As far as I understand, it is assumed and expected the each work item gets
> completed within decent time. Then this blk_mq_run_work_fn must stop within
> decent time by breaking the loop at some point. As the loop break conditions
> other than need_resched(), I can think of 1) loop count, 2) number of requests
> dispatched or 3) time spent in the loop. All of the three require a magic random
> number as the limit... Is there any other better way?
> 
> If we need to choose one of the 3 options, I think '3) time spent in the loop'
> is better than others, since workqueue watchdog monitors _time_ to check lockup
> and report the BUG message.

BTW, just tried 3), then the lockup issue can't be reproduced any more:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e6ad8f761474..b4de5a7ec606 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -181,10 +181,14 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
        int ret;
+       unsigned long start = jiffies;
 
        do {
                ret = __blk_mq_do_dispatch_sched(hctx);
-       } while (ret == 1);
+       } while (ret == 1 && !need_resched() && (jiffies - start) < HZ);
+
+       if (ret == 1 && (need_resched() || jiffies - start >= HZ))
+                blk_mq_delay_run_hw_queue(hctx, 0);
 
        return ret;
 }



Thanks, 
Ming
Shinichiro Kawasaki March 15, 2022, 5:24 a.m. UTC | #8
On Mar 14, 2022 / 15:00, Ming Lei wrote:
> On Mon, Mar 14, 2022 at 05:24:34AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 11, 2022 / 17:51, Ming Lei wrote:
> > > On Fri, Mar 11, 2022 at 06:24:41AM +0000, Shinichiro Kawasaki wrote:
> > > > On Mar 10, 2022 / 05:47, Jens Axboe wrote:
> > > > > On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> > > > > > On Mar 10, 2022 / 18:00, Ming Lei wrote:
> > > > > >> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> > > > > >>> This issue does not look critical, but let me share it to ask comments for fix.
> > > > > >>>
> > > > > >>> When fio command with 40 jobs [1] is run for a null_blk device with memory
> > > > > >>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> > > > > >>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> > > > > >>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> > > > > >>> creating many read requests to a single null_blk device, then the every time
> > > > > >>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> > > > > >>> means more than one request was dispatched. Hence, the while loop in
> > > > > >>> blk_mq_do_dispatch_sched() does not break.
> > > > > >>>
> > > > > >>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > >>> {
> > > > > >>>         int ret;
> > > > > >>>
> > > > > >>>         do {
> > > > > >>>                ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > >>>         } while (ret == 1);
> > > > > >>>
> > > > > >>>         return ret;
> > > > > >>> }
> > > > > >>>
> > > > > >>> The BUG message was observed when I ran blktests block/005 with various
> > > > > >>> conditions on a system with 40 CPUs. It was observed with kernel version
> > > > > >>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> > > > > >>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> > > > > >>> guess it changed dispatch behavior for null_blk devices and triggered the
> > > > > >>> BUG message.
> > > > > >>
> > > > > >> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> > > > > >> with 0a593fbbc245.
> > > > > >>
> > > > > >> If queueing requests is faster than dispatching, the issue will be triggered
> > > > > >> sooner or later, especially easy to trigger in SQ device. I am sure it can
> > > > > >> be triggered on scsi debug, even saw such report on ahci.
> > > > > > 
> > > > > > Thank you for the comments. Then this is the real problem.
> > > > > > 
> > > > > >>
> > > > > >>>
> > > > > >>> I'm not so sure if we really need to fix this issue. It does not seem the real
> > > > > >>> world problem since it is observed only with null_blk. The real block devices
> > > > > >>> have slower IO operation then the dispatch should stop sooner when the hardware
> > > > > >>> queue gets full. Also the 40 jobs for single device is not realistic workload.
> > > > > >>>
> > > > > >>> Having said that, it does not feel right that other works are pended during
> > > > > >>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> > > > > >>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> > > > > >>> using a loop counter [3] (or jiffies timeout check).
> > > > > >>
> > > > > >> This way could work, but the queue need to be re-run after breaking
> > > > > >> caused by max dispatch number. cond_resched() might be the simplest way,
> > > > > >> but it can't be used here because of rcu/srcu read lock.
> > > > > > 
> > > > > > As far as I understand, blk_mq_run_work_fn() should return after the loop break
> > > > > > to yield the worker to other works. How about to call
> > > > > > blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> > > > > > 
> > > > > > 
> > > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > index 55488ba978232..faa29448a72a0 100644
> > > > > > --- a/block/blk-mq-sched.c
> > > > > > +++ b/block/blk-mq-sched.c
> > > > > > @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > >  	return !!dispatched;
> > > > > >  }
> > > > > >  
> > > > > > +#define MQ_DISPATCH_MAX 0x10000
> > > > > > +
> > > > > >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > >  {
> > > > > >  	int ret;
> > > > > > +	unsigned int count = MQ_DISPATCH_MAX;
> > > > > >  
> > > > > >  	do {
> > > > > >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > > -	} while (ret == 1);
> > > > > > +	} while (ret == 1 && count--);
> > > > > > +
> > > > > > +	if (ret == 1 && !count)
> > > > > > +		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > > >  
> > > > > >  	return ret;
> > > > > >  }
> > > > > 
> > > > > Why not just gate it on needing to reschedule, rather than some random
> > > > > value?
> > > > > 
> > > > > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > {
> > > > > 	int ret;
> > > > > 
> > > > > 	do {
> > > > > 		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > 	} while (ret == 1 && !need_resched());
> > > > > 
> > > > > 	if (ret == 1 && need_resched())
> > > > > 		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > > 
> > > > > 	return ret;
> > > > > }
> > > > > 
> > > > > or something like that.
> > > > 
> > > > Jens, thanks for the idea, but need_resched() check does not look working here.
> > > > I tried the code above but still the BUG message is observed. My guess is that
> > > > in the call stack below, every __blk_mq_do_dispatch_sched() call results in
> > > > might_sleep_if() call, then need_resched() does not work as expected, probably.
> > > > 
> > > > __blk_mq_do_dispatch_sched
> > > >   blk_mq_dispatch_rq_list
> > > >     q->mq_ops->queue_rq
> > > >       null_queue_rq
> > > >         might_sleep_if
> > > > 
> > > > Now I'm trying to find similar way as need_resched() to avoid the random number.
> > > > So far I haven't found good idea yet.
> > > 
> > > Jens patch using need_resched() looks improving the situation, also the
> > > scsi_debug case won't set BLOCKING:
> > > 
> > > 1) without the patch, it can be easy to trigger lockup with the
> > > following test.
> > > 
> > > - modprobe scsi_debug virtual_gb=128 delay=0 dev_size_mb=2048
> > > - fio --bs=512k --ioengine=sync --iodepth=128 --numjobs=4 --rw=randrw \
> > > 	--name=sdc-sync-randrw-512k --filename=/dev/sdc --direct=1 --size=60G --runtime=120
> > > 
> > > #sdc is the created scsi_debug disk
> > 
> > Thanks. I tried the work load above and observed the lockup BUG message on my
> > system. So, I reconfirmed that the problem happens with both BLOCKING and
> > non-BLOCKING drivers.
> > 
> > Regarding the solution, I can not think of any good one. I tried to remove the
> > WQ_HIGHPRI flag from kblockd_workqueue, but it did not look affecting
> > need_resched() behavior. I walked through workqueue API, but was not able
> > to find anything useful.
> > 
> > As far as I understand, it is assumed and expected the each work item gets
> > completed within decent time. Then this blk_mq_run_work_fn must stop within
> > decent time by breaking the loop at some point. As the loop break conditions
> > other than need_resched(), I can think of 1) loop count, 2) number of requests
> > dispatched or 3) time spent in the loop. All of the three require a magic random
> > number as the limit... Is there any other better way?
> > 
> > If we need to choose one of the 3 options, I think '3) time spent in the loop'
> > is better than others, since workqueue watchdog monitors _time_ to check lockup
> > and report the BUG message.
> 
> BTW, just tried 3), then the lockup issue can't be reproduced any more:
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index e6ad8f761474..b4de5a7ec606 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -181,10 +181,14 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>         int ret;
> +       unsigned long start = jiffies;
>  
>         do {
>                 ret = __blk_mq_do_dispatch_sched(hctx);
> -       } while (ret == 1);
> +       } while (ret == 1 && !need_resched() && (jiffies - start) < HZ);
> +
> +       if (ret == 1 && (need_resched() || jiffies - start >= HZ))
> +                blk_mq_delay_run_hw_queue(hctx, 0);
>  
>         return ret;
>  }

It sounds a good idea to check both need_resched() and 3) time spent in the
loop. I also confirmed that this fix avoids the BUG message on the scsi_debug
workload as well as null_blk with memory backing. Looks good. For this
confirmation, I modified the hunk above to avoid duplicated checks [1].

As for the loop break limit, I think HZ = 1 second is appropriate. The workqueue
watchdog checks lockup with duration 'wq_watchdog_thresh' defined in
kernel/workqueue.c. In the worst case, its number is 1, meaning 1 second. Then,
1 second loop break in blk_mq_do_dispatch_sched() should avoid the BUG message.

To reduce influence on the performance, it would be good to make this number
larger. One idea was to refer the wq_watchdog_thresh as the limit for the loop
break. However, the variable is static and defined only when CONFIG_WQ_WATCHDOG
is enabled. So, I don't think block layer can refer it.

Assuming this fix approach is ok, I would like to have a formal patch. Ming,
would your mind to create it? Or if you want, I'm willing to do that.

[1]

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55488ba978232..64941615befc6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -181,9 +181,15 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	int ret;
+	unsigned long end = jiffies + HZ;
 
 	do {
 		ret = __blk_mq_do_dispatch_sched(hctx);
+		if (ret == 1 &&
+		    (need_resched() || time_is_after_jiffies(end))) {
+			blk_mq_delay_run_hw_queue(hctx, 0);
+			break;
+		}
 	} while (ret == 1);
 
 	return ret;
Ming Lei March 15, 2022, 6:10 a.m. UTC | #9
On Tue, Mar 15, 2022 at 05:24:32AM +0000, Shinichiro Kawasaki wrote:
> On Mar 14, 2022 / 15:00, Ming Lei wrote:
> > On Mon, Mar 14, 2022 at 05:24:34AM +0000, Shinichiro Kawasaki wrote:
> > > On Mar 11, 2022 / 17:51, Ming Lei wrote:
> > > > On Fri, Mar 11, 2022 at 06:24:41AM +0000, Shinichiro Kawasaki wrote:
> > > > > On Mar 10, 2022 / 05:47, Jens Axboe wrote:
> > > > > > On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> > > > > > > On Mar 10, 2022 / 18:00, Ming Lei wrote:
> > > > > > >> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> > > > > > >>> This issue does not look critical, but let me share it to ask comments for fix.
> > > > > > >>>
> > > > > > >>> When fio command with 40 jobs [1] is run for a null_blk device with memory
> > > > > > >>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> > > > > > >>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> > > > > > >>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> > > > > > >>> creating many read requests to a single null_blk device, then the every time
> > > > > > >>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> > > > > > >>> means more than one request was dispatched. Hence, the while loop in
> > > > > > >>> blk_mq_do_dispatch_sched() does not break.
> > > > > > >>>
> > > > > > >>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > >>> {
> > > > > > >>>         int ret;
> > > > > > >>>
> > > > > > >>>         do {
> > > > > > >>>                ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > > >>>         } while (ret == 1);
> > > > > > >>>
> > > > > > >>>         return ret;
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> The BUG message was observed when I ran blktests block/005 with various
> > > > > > >>> conditions on a system with 40 CPUs. It was observed with kernel version
> > > > > > >>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> > > > > > >>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> > > > > > >>> guess it changed dispatch behavior for null_blk devices and triggered the
> > > > > > >>> BUG message.
> > > > > > >>
> > > > > > >> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> > > > > > >> with 0a593fbbc245.
> > > > > > >>
> > > > > > >> If queueing requests is faster than dispatching, the issue will be triggered
> > > > > > >> sooner or later, especially easy to trigger in SQ device. I am sure it can
> > > > > > >> be triggered on scsi debug, even saw such report on ahci.
> > > > > > > 
> > > > > > > Thank you for the comments. Then this is the real problem.
> > > > > > > 
> > > > > > >>
> > > > > > >>>
> > > > > > >>> I'm not so sure if we really need to fix this issue. It does not seem the real
> > > > > > >>> world problem since it is observed only with null_blk. The real block devices
> > > > > > >>> have slower IO operation then the dispatch should stop sooner when the hardware
> > > > > > >>> queue gets full. Also the 40 jobs for single device is not realistic workload.
> > > > > > >>>
> > > > > > >>> Having said that, it does not feel right that other works are pended during
> > > > > > >>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> > > > > > >>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> > > > > > >>> using a loop counter [3] (or jiffies timeout check).
> > > > > > >>
> > > > > > >> This way could work, but the queue need to be re-run after breaking
> > > > > > >> caused by max dispatch number. cond_resched() might be the simplest way,
> > > > > > >> but it can't be used here because of rcu/srcu read lock.
> > > > > > > 
> > > > > > > As far as I understand, blk_mq_run_work_fn() should return after the loop break
> > > > > > > to yield the worker to other works. How about to call
> > > > > > > blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > > index 55488ba978232..faa29448a72a0 100644
> > > > > > > --- a/block/blk-mq-sched.c
> > > > > > > +++ b/block/blk-mq-sched.c
> > > > > > > @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > >  	return !!dispatched;
> > > > > > >  }
> > > > > > >  
> > > > > > > +#define MQ_DISPATCH_MAX 0x10000
> > > > > > > +
> > > > > > >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > >  {
> > > > > > >  	int ret;
> > > > > > > +	unsigned int count = MQ_DISPATCH_MAX;
> > > > > > >  
> > > > > > >  	do {
> > > > > > >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > > > -	} while (ret == 1);
> > > > > > > +	} while (ret == 1 && count--);
> > > > > > > +
> > > > > > > +	if (ret == 1 && !count)
> > > > > > > +		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > > > >  
> > > > > > >  	return ret;
> > > > > > >  }
> > > > > > 
> > > > > > Why not just gate it on needing to reschedule, rather than some random
> > > > > > value?
> > > > > > 
> > > > > > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > {
> > > > > > 	int ret;
> > > > > > 
> > > > > > 	do {
> > > > > > 		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > > 	} while (ret == 1 && !need_resched());
> > > > > > 
> > > > > > 	if (ret == 1 && need_resched())
> > > > > > 		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > > > 
> > > > > > 	return ret;
> > > > > > }
> > > > > > 
> > > > > > or something like that.
> > > > > 
> > > > > Jens, thanks for the idea, but need_resched() check does not look working here.
> > > > > I tried the code above but still the BUG message is observed. My guess is that
> > > > > in the call stack below, every __blk_mq_do_dispatch_sched() call results in
> > > > > might_sleep_if() call, then need_resched() does not work as expected, probably.
> > > > > 
> > > > > __blk_mq_do_dispatch_sched
> > > > >   blk_mq_dispatch_rq_list
> > > > >     q->mq_ops->queue_rq
> > > > >       null_queue_rq
> > > > >         might_sleep_if
> > > > > 
> > > > > Now I'm trying to find similar way as need_resched() to avoid the random number.
> > > > > So far I haven't found good idea yet.
> > > > 
> > > > Jens patch using need_resched() looks improving the situation, also the
> > > > scsi_debug case won't set BLOCKING:
> > > > 
> > > > 1) without the patch, it can be easy to trigger lockup with the
> > > > following test.
> > > > 
> > > > - modprobe scsi_debug virtual_gb=128 delay=0 dev_size_mb=2048
> > > > - fio --bs=512k --ioengine=sync --iodepth=128 --numjobs=4 --rw=randrw \
> > > > 	--name=sdc-sync-randrw-512k --filename=/dev/sdc --direct=1 --size=60G --runtime=120
> > > > 
> > > > #sdc is the created scsi_debug disk
> > > 
> > > Thanks. I tried the work load above and observed the lockup BUG message on my
> > > system. So, I reconfirmed that the problem happens with both BLOCKING and
> > > non-BLOCKING drivers.
> > > 
> > > Regarding the solution, I can not think of any good one. I tried to remove the
> > > WQ_HIGHPRI flag from kblockd_workqueue, but it did not look affecting
> > > need_resched() behavior. I walked through workqueue API, but was not able
> > > to find anything useful.
> > > 
> > > As far as I understand, it is assumed and expected the each work item gets
> > > completed within decent time. Then this blk_mq_run_work_fn must stop within
> > > decent time by breaking the loop at some point. As the loop break conditions
> > > other than need_resched(), I can think of 1) loop count, 2) number of requests
> > > dispatched or 3) time spent in the loop. All of the three require a magic random
> > > number as the limit... Is there any other better way?
> > > 
> > > If we need to choose one of the 3 options, I think '3) time spent in the loop'
> > > is better than others, since workqueue watchdog monitors _time_ to check lockup
> > > and report the BUG message.
> > 
> > BTW, just tried 3), then the lockup issue can't be reproduced any more:
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index e6ad8f761474..b4de5a7ec606 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -181,10 +181,14 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  {
> >         int ret;
> > +       unsigned long start = jiffies;
> >  
> >         do {
> >                 ret = __blk_mq_do_dispatch_sched(hctx);
> > -       } while (ret == 1);
> > +       } while (ret == 1 && !need_resched() && (jiffies - start) < HZ);
> > +
> > +       if (ret == 1 && (need_resched() || jiffies - start >= HZ))
> > +                blk_mq_delay_run_hw_queue(hctx, 0);
> >  
> >         return ret;
> >  }
> 
> It sounds a good idea to check both need_resched() and 3) time spent in the
> loop. I also confirmed that this fix avoids the BUG message on the scsi_debug
> workload as well as null_blk with memory backing. Looks good. For this
> confirmation, I modified the hunk above to avoid duplicated checks [1].
> 
> As for the loop break limit, I think HZ = 1 second is appropriate. The workqueue
> watchdog checks lockup with duration 'wq_watchdog_thresh' defined in
> kernel/workqueue.c. In the worst case, its number is 1, meaning 1 second. Then,
> 1 second loop break in blk_mq_do_dispatch_sched() should avoid the BUG message.
> 
> To reduce influence on the performance, it would be good to make this number
> larger. One idea was to refer the wq_watchdog_thresh as the limit for the loop
> break. However, the variable is static and defined only when CONFIG_WQ_WATCHDOG
> is enabled. So, I don't think block layer can refer it.
> 
> Assuming this fix approach is ok, I would like to have a formal patch. Ming,
> would your mind to create it? Or if you want, I'm willing to do that.
> 
> [1]
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55488ba978232..64941615befc6 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -181,9 +181,15 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
>  {
>  	int ret;
> +	unsigned long end = jiffies + HZ;
>  
>  	do {
>  		ret = __blk_mq_do_dispatch_sched(hctx);
> +		if (ret == 1 &&
> +		    (need_resched() || time_is_after_jiffies(end))) {
> +			blk_mq_delay_run_hw_queue(hctx, 0);
> +			break;
> +		}
>  	} while (ret == 1);

I am fine with this patch, so please prepare one formal patch and see
if Jens and guys are fine with it.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming
Shinichiro Kawasaki March 15, 2022, 11:10 a.m. UTC | #10
On Mar 15, 2022 / 14:10, Ming Lei wrote:
> On Tue, Mar 15, 2022 at 05:24:32AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 14, 2022 / 15:00, Ming Lei wrote:
> > > On Mon, Mar 14, 2022 at 05:24:34AM +0000, Shinichiro Kawasaki wrote:
> > > > On Mar 11, 2022 / 17:51, Ming Lei wrote:
> > > > > On Fri, Mar 11, 2022 at 06:24:41AM +0000, Shinichiro Kawasaki wrote:
> > > > > > On Mar 10, 2022 / 05:47, Jens Axboe wrote:
> > > > > > > On 3/10/22 5:40 AM, Shinichiro Kawasaki wrote:
> > > > > > > > On Mar 10, 2022 / 18:00, Ming Lei wrote:
> > > > > > > >> On Thu, Mar 10, 2022 at 09:16:50AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > >>> This issue does not look critical, but let me share it to ask comments for fix.
> > > > > > > >>>
> > > > > > > >>> When fio command with 40 jobs [1] is run for a null_blk device with memory
> > > > > > > >>> backing and mq-deadline scheduler, kernel reports a BUG message [2]. The
> > > > > > > >>> workqueue watchdog reports that kblockd blk_mq_run_work_fn keeps on running
> > > > > > > >>> more than 30 seconds and other work can not run. The 40 fio jobs keep on
> > > > > > > >>> creating many read requests to a single null_blk device, then the every time
> > > > > > > >>> the mq_run task calls __blk_mq_do_dispatch_sched(), it returns ret == 1 which
> > > > > > > >>> means more than one request was dispatched. Hence, the while loop in
> > > > > > > >>> blk_mq_do_dispatch_sched() does not break.
> > > > > > > >>>
> > > > > > > >>> static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > > >>> {
> > > > > > > >>>         int ret;
> > > > > > > >>>
> > > > > > > >>>         do {
> > > > > > > >>>                ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > > > >>>         } while (ret == 1);
> > > > > > > >>>
> > > > > > > >>>         return ret;
> > > > > > > >>> }
> > > > > > > >>>
> > > > > > > >>> The BUG message was observed when I ran blktests block/005 with various
> > > > > > > >>> conditions on a system with 40 CPUs. It was observed with kernel version
> > > > > > > >>> v5.16-rc1 through v5.17-rc7. The trigger commit was 0a593fbbc245 ("null_blk:
> > > > > > > >>> poll queue support"). This commit added blk_mq_ops.map_queues callback. I
> > > > > > > >>> guess it changed dispatch behavior for null_blk devices and triggered the
> > > > > > > >>> BUG message.
> > > > > > > >>
> > > > > > > >> It is one blk-mq soft lockup issue in dispatch side, and shouldn't be related
> > > > > > > >> with 0a593fbbc245.
> > > > > > > >>
> > > > > > > >> If queueing requests is faster than dispatching, the issue will be triggered
> > > > > > > >> sooner or later, especially easy to trigger in SQ device. I am sure it can
> > > > > > > >> be triggered on scsi debug, even saw such report on ahci.
> > > > > > > > 
> > > > > > > > Thank you for the comments. Then this is the real problem.
> > > > > > > > 
> > > > > > > >>
> > > > > > > >>>
> > > > > > > >>> I'm not so sure if we really need to fix this issue. It does not seem the real
> > > > > > > >>> world problem since it is observed only with null_blk. The real block devices
> > > > > > > >>> have slower IO operation then the dispatch should stop sooner when the hardware
> > > > > > > >>> queue gets full. Also the 40 jobs for single device is not realistic workload.
> > > > > > > >>>
> > > > > > > >>> Having said that, it does not feel right that other works are pended during
> > > > > > > >>> dispatch for null_blk devices. To avoid the BUG message, I can think of two
> > > > > > > >>> fix approaches. First one is to break the while loop in blk_mq_do_dispatch_sched
> > > > > > > >>> using a loop counter [3] (or jiffies timeout check).
> > > > > > > >>
> > > > > > > >> This way could work, but the queue need to be re-run after breaking
> > > > > > > >> caused by max dispatch number. cond_resched() might be the simplest way,
> > > > > > > >> but it can't be used here because of rcu/srcu read lock.
> > > > > > > > 
> > > > > > > > As far as I understand, blk_mq_run_work_fn() should return after the loop break
> > > > > > > > to yield the worker to other works. How about to call
> > > > > > > > blk_mq_delay_run_hw_queue() at the loop break? Does this re-run the dispatch?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > > > > index 55488ba978232..faa29448a72a0 100644
> > > > > > > > --- a/block/blk-mq-sched.c
> > > > > > > > +++ b/block/blk-mq-sched.c
> > > > > > > > @@ -178,13 +178,19 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > > >  	return !!dispatched;
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +#define MQ_DISPATCH_MAX 0x10000
> > > > > > > > +
> > > > > > > >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > > >  {
> > > > > > > >  	int ret;
> > > > > > > > +	unsigned int count = MQ_DISPATCH_MAX;
> > > > > > > >  
> > > > > > > >  	do {
> > > > > > > >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > > > > -	} while (ret == 1);
> > > > > > > > +	} while (ret == 1 && count--);
> > > > > > > > +
> > > > > > > > +	if (ret == 1 && !count)
> > > > > > > > +		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > > > > >  
> > > > > > > >  	return ret;
> > > > > > > >  }
> > > > > > > 
> > > > > > > Why not just gate it on needing to reschedule, rather than some random
> > > > > > > value?
> > > > > > > 
> > > > > > > static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > > > > > > {
> > > > > > > 	int ret;
> > > > > > > 
> > > > > > > 	do {
> > > > > > > 		ret = __blk_mq_do_dispatch_sched(hctx);
> > > > > > > 	} while (ret == 1 && !need_resched());
> > > > > > > 
> > > > > > > 	if (ret == 1 && need_resched())
> > > > > > > 		blk_mq_delay_run_hw_queue(hctx, 0);
> > > > > > > 
> > > > > > > 	return ret;
> > > > > > > }
> > > > > > > 
> > > > > > > or something like that.
> > > > > > 
> > > > > > Jens, thanks for the idea, but need_resched() check does not look working here.
> > > > > > I tried the code above but still the BUG message is observed. My guess is that
> > > > > > in the call stack below, every __blk_mq_do_dispatch_sched() call results in
> > > > > > might_sleep_if() call, then need_resched() does not work as expected, probably.
> > > > > > 
> > > > > > __blk_mq_do_dispatch_sched
> > > > > >   blk_mq_dispatch_rq_list
> > > > > >     q->mq_ops->queue_rq
> > > > > >       null_queue_rq
> > > > > >         might_sleep_if
> > > > > > 
> > > > > > Now I'm trying to find similar way as need_resched() to avoid the random number.
> > > > > > So far I haven't found good idea yet.
> > > > > 
> > > > > Jens patch using need_resched() looks improving the situation, also the
> > > > > scsi_debug case won't set BLOCKING:
> > > > > 
> > > > > 1) without the patch, it can be easy to trigger lockup with the
> > > > > following test.
> > > > > 
> > > > > - modprobe scsi_debug virtual_gb=128 delay=0 dev_size_mb=2048
> > > > > - fio --bs=512k --ioengine=sync --iodepth=128 --numjobs=4 --rw=randrw \
> > > > > 	--name=sdc-sync-randrw-512k --filename=/dev/sdc --direct=1 --size=60G --runtime=120
> > > > > 
> > > > > #sdc is the created scsi_debug disk
> > > > 
> > > > Thanks. I tried the work load above and observed the lockup BUG message on my
> > > > system. So, I reconfirmed that the problem happens with both BLOCKING and
> > > > non-BLOCKING drivers.
> > > > 
> > > > Regarding the solution, I can not think of any good one. I tried to remove the
> > > > WQ_HIGHPRI flag from kblockd_workqueue, but it did not look affecting
> > > > need_resched() behavior. I walked through workqueue API, but was not able
> > > > to find anything useful.
> > > > 
> > > > As far as I understand, it is assumed and expected the each work item gets
> > > > completed within decent time. Then this blk_mq_run_work_fn must stop within
> > > > decent time by breaking the loop at some point. As the loop break conditions
> > > > other than need_resched(), I can think of 1) loop count, 2) number of requests
> > > > dispatched or 3) time spent in the loop. All of the three require a magic random
> > > > number as the limit... Is there any other better way?
> > > > 
> > > > If we need to choose one of the 3 options, I think '3) time spent in the loop'
> > > > is better than others, since workqueue watchdog monitors _time_ to check lockup
> > > > and report the BUG message.
> > > 
> > > BTW, just tried 3), then the lockup issue can't be reproduced any more:
> > > 
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > index e6ad8f761474..b4de5a7ec606 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -181,10 +181,14 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> > >  {
> > >         int ret;
> > > +       unsigned long start = jiffies;
> > >  
> > >         do {
> > >                 ret = __blk_mq_do_dispatch_sched(hctx);
> > > -       } while (ret == 1);
> > > +       } while (ret == 1 && !need_resched() && (jiffies - start) < HZ);
> > > +
> > > +       if (ret == 1 && (need_resched() || jiffies - start >= HZ))
> > > +                blk_mq_delay_run_hw_queue(hctx, 0);
> > >  
> > >         return ret;
> > >  }
> > 
> > It sounds a good idea to check both need_resched() and 3) time spent in the
> > loop. I also confirmed that this fix avoids the BUG message on the scsi_debug
> > workload as well as null_blk with memory backing. Looks good. For this
> > confirmation, I modified the hunk above to avoid duplicated checks [1].
> > 
> > As for the loop break limit, I think HZ = 1 second is appropriate. The workqueue
> > watchdog checks lockup with duration 'wq_watchdog_thresh' defined in
> > kernel/workqueue.c. In the worst case, its number is 1, meaning 1 second. Then,
> > 1 second loop break in blk_mq_do_dispatch_sched() should avoid the BUG message.
> > 
> > To reduce influence on the performance, it would be good to make this number
> > larger. One idea was to refer the wq_watchdog_thresh as the limit for the loop
> > break. However, the variable is static and defined only when CONFIG_WQ_WATCHDOG
> > is enabled. So, I don't think block layer can refer it.
> > 
> > Assuming this fix approach is ok, I would like to have a formal patch. Ming,
> > would your mind to create it? Or if you want, I'm willing to do that.
> > 
> > [1]
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index 55488ba978232..64941615befc6 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -181,9 +181,15 @@ static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
> >  {
> >  	int ret;
> > +	unsigned long end = jiffies + HZ;
> >  
> >  	do {
> >  		ret = __blk_mq_do_dispatch_sched(hctx);
> > +		if (ret == 1 &&
> > +		    (need_resched() || time_is_after_jiffies(end))) {
> > +			blk_mq_delay_run_hw_queue(hctx, 0);
> > +			break;
> > +		}
> >  	} while (ret == 1);
> 
> I am fine with this patch, so please prepare one formal patch and see
> if Jens and guys are fine with it.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Tested-by: Ming Lei <ming.lei@redhat.com>

Okay, I will prepare the patch. Now I'm running regression test run for
confirmation, and plan to post the patch tomorrow.

I also confirmed that the BUG message is triggered on stable kernel v5.10.96
using scsi_debug. The patch above can be applied to this stable kernel and
avoids the BUG. I will add Fixes tag for the commit 6e6fcbc27e778 ("blk-mq:
support batching dispatch in case of io"), so that the patch gets propagated
to stable kernels.
diff mbox series

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55488ba978232..746c75cb6aafb 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -178,13 +178,16 @@  static int __blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 	return !!dispatched;
 }
 
+#define DISPATCH_BREAK_COUNT 0x10000
+
 static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 {
 	int ret;
+	unsigned int count = DISPATCH_BREAK_COUNT;
 
 	do {
 		ret = __blk_mq_do_dispatch_sched(hctx);
-	} while (ret == 1);
+	} while (ret == 1 && count--);
 
 	return ret;
 }


[4]

diff --git a/drivers/block/null_blk/main.c b/drivers/block/null_blk/main.c
index 13004beb48cab..3ef3a18fa8f3f 100644
--- a/drivers/block/null_blk/main.c
+++ b/drivers/block/null_blk/main.c
@@ -22,6 +22,8 @@  static DECLARE_FAULT_ATTR(null_requeue_attr);
 static DECLARE_FAULT_ATTR(null_init_hctx_attr);
 #endif
 
+#define REQUEST_BREAK_COUNT 0x10000
+
 static inline u64 mb_per_tick(int mbps)
 {
 	return (1 << 20) / TICKS_PER_SEC * ((u64) mbps);
@@ -1493,13 +1495,13 @@  static bool should_timeout_request(struct request *rq)
 	return false;
 }
 
-static bool should_requeue_request(struct request *rq)
+static bool should_requeue_request(struct request *rq, struct nullb_queue *nq)
 {
 #ifdef CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION
 	if (g_requeue_str[0])
 		return should_fail(&null_requeue_attr, 1);
 #endif
-	return false;
+	return !(nq->request_count % REQUEST_BREAK_COUNT);
 }
 
 static int null_map_queues(struct blk_mq_tag_set *set)
@@ -1632,8 +1634,9 @@  static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
 	cmd->fake_timeout = should_timeout_request(bd->rq);
 
 	blk_mq_start_request(bd->rq);
+	nq->request_count++;
 
-	if (should_requeue_request(bd->rq)) {
+	if (should_requeue_request(bd->rq, nq)) {
 		/*
 		 * Alternate between hitting the core BUSY path, and the
 		 * driver driven requeue path
@@ -1690,6 +1693,7 @@  static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
 	nq->dev = nullb->dev;
 	INIT_LIST_HEAD(&nq->poll_list);
 	spin_lock_init(&nq->poll_lock);
+	nq->request_count = 0;
 }
 
 static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 78eb56b0ca55f..165669ce97ad3 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -36,6 +36,7 @@  struct nullb_queue {
 	spinlock_t poll_lock;
 
 	struct nullb_cmd *cmds;
+	unsigned int request_count;
 };
 
 struct nullb_zone {