diff mbox

[1/6] blk-mq: Make blk_mq_quiesce_queue() wait for all .queue_rq() calls

Message ID 20170427155437.23228-2-bart.vanassche@sandisk.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche April 27, 2017, 3:54 p.m. UTC
blk_mq_quiesce_queue() callers, e.g. elevator_switch_mq(), assume
that no .queue_rq() calls occur while switching to another I/O
scheduler. This patch fixes the following kernel crash if another
I/O scheduler than "none" is the default scheduler:

general protection fault: 0000 [#1] SMP
RIP: 0010:__lock_acquire+0xfe/0x1280
Call Trace:
 lock_acquire+0xd5/0x1c0
 _raw_spin_lock+0x2a/0x40
 dd_dispatch_request+0x29/0x1e0
 blk_mq_sched_dispatch_requests+0x139/0x190
 __blk_mq_run_hw_queue+0x12d/0x1c0
 blk_mq_run_work_fn+0xd/0x10
 process_one_work+0x206/0x6a0
 worker_thread+0x49/0x4a0
 kthread+0x107/0x140
 ret_from_fork+0x2e/0x40

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: <stable@vger.kernel.org>
---
 block/blk-mq.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Ming Lei April 28, 2017, 2 a.m. UTC | #1
Hi Bart,

On Thu, Apr 27, 2017 at 08:54:32AM -0700, Bart Van Assche wrote:
> blk_mq_quiesce_queue() callers, e.g. elevator_switch_mq(), assume
> that no .queue_rq() calls occur while switching to another I/O

I think we should call blk_mq_freeze_queue_wait() instead
of blk_quiesce_queue() in elevator_switch_mq(), otherwise
it is easy to cause use-after-free.

> scheduler. This patch fixes the following kernel crash if another
> I/O scheduler than "none" is the default scheduler:
> 
> general protection fault: 0000 [#1] SMP
> RIP: 0010:__lock_acquire+0xfe/0x1280
> Call Trace:
>  lock_acquire+0xd5/0x1c0
>  _raw_spin_lock+0x2a/0x40
>  dd_dispatch_request+0x29/0x1e0
>  blk_mq_sched_dispatch_requests+0x139/0x190
>  __blk_mq_run_hw_queue+0x12d/0x1c0
>  blk_mq_run_work_fn+0xd/0x10
>  process_one_work+0x206/0x6a0
>  worker_thread+0x49/0x4a0
>  kthread+0x107/0x140
>  ret_from_fork+0x2e/0x40
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Omar Sandoval <osandov@fb.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: <stable@vger.kernel.org>
> ---
>  block/blk-mq.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b75ef2392db7..3b3420f76b5a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1224,8 +1224,9 @@ EXPORT_SYMBOL(blk_mq_queue_stopped);
>  
>  void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
>  {
> -	cancel_work(&hctx->run_work);
> -	cancel_delayed_work(&hctx->delay_work);
> +	cancel_work_sync(&hctx->run_work);
> +	cancel_delayed_work_sync(&hctx->delay_work);

Could you explain it a bit why we need the sync version?

> +	cancel_delayed_work_sync(&hctx->delayed_run_work);

More introduced, more bugs may come, :-)

So I suggest to unity both .run_work and .dealyed_run_work
into one work, just as what Jens did in the following link:

	http://marc.info/?t=149183989800010&r=1&w=2


Thanks,
Ming
Ming Lei April 28, 2017, 3:48 a.m. UTC | #2
Hi,

On Fri, Apr 28, 2017 at 10:00:44AM +0800, Ming Lei wrote:
> Hi Bart,
> 
> On Thu, Apr 27, 2017 at 08:54:32AM -0700, Bart Van Assche wrote:
> > blk_mq_quiesce_queue() callers, e.g. elevator_switch_mq(), assume
> > that no .queue_rq() calls occur while switching to another I/O
> 
> I think we should call blk_mq_freeze_queue_wait() instead
> of blk_quiesce_queue() in elevator_switch_mq(), otherwise
> it is easy to cause use-after-free.

Sorry, my fault, blk_mq_freeze_queue_wait() has been done in
blk_mq_freeze_queue() already, so no new IO can enter queue
any more, and all old pending IOs will be fninished when 
blk_mq_freeze_queue() returns.

> 
> > scheduler. This patch fixes the following kernel crash if another
> > I/O scheduler than "none" is the default scheduler:
> > 
> > general protection fault: 0000 [#1] SMP
> > RIP: 0010:__lock_acquire+0xfe/0x1280
> > Call Trace:
> >  lock_acquire+0xd5/0x1c0
> >  _raw_spin_lock+0x2a/0x40
> >  dd_dispatch_request+0x29/0x1e0
> >  blk_mq_sched_dispatch_requests+0x139/0x190
> >  __blk_mq_run_hw_queue+0x12d/0x1c0
> >  blk_mq_run_work_fn+0xd/0x10
> >  process_one_work+0x206/0x6a0
> >  worker_thread+0x49/0x4a0
> >  kthread+0x107/0x140
> >  ret_from_fork+0x2e/0x40

So I am just wondering where the I/O in the above issue is
from? and feels there might be other bug behind the issue. 

Also blk_mq_quiesce_queue() in elevator_switch_mq()
shouldn't be necessary.

Thanks,
Ming
Bart Van Assche April 28, 2017, 3:35 p.m. UTC | #3
On Fri, 2017-04-28 at 10:00 +0800, Ming Lei wrote:
> On Thu, Apr 27, 2017 at 08:54:32AM -0700, Bart Van Assche wrote:
> >  void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
> >  {
> > -	cancel_work(&hctx->run_work);
> > -	cancel_delayed_work(&hctx->delay_work);
> > +	cancel_work_sync(&hctx->run_work);
> > +	cancel_delayed_work_sync(&hctx->delay_work);
> 
> Could you explain it a bit why we need the sync version?

Because the purpose of this patch is to make blk_mq_quiesce_queue() wait for
all .queue_rq() calls.

> So I suggest to unity both .run_work and .dealyed_run_work
> into one work, just as what Jens did in the following link:
> 
> 	http://marc.info/?t=149183989800010&r=1&w=2

That should be done after this patch is upstream otherwise this
patch won't apply to the stable trees.

Bart.
Ming Lei April 28, 2017, 4:02 p.m. UTC | #4
On Fri, Apr 28, 2017 at 03:35:59PM +0000, Bart Van Assche wrote:
> On Fri, 2017-04-28 at 10:00 +0800, Ming Lei wrote:
> > On Thu, Apr 27, 2017 at 08:54:32AM -0700, Bart Van Assche wrote:
> > >  void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
> > >  {
> > > -	cancel_work(&hctx->run_work);
> > > -	cancel_delayed_work(&hctx->delay_work);
> > > +	cancel_work_sync(&hctx->run_work);
> > > +	cancel_delayed_work_sync(&hctx->delay_work);
> > 
> > Could you explain it a bit why we need the sync version?
> 
> Because the purpose of this patch is to make blk_mq_quiesce_queue() wait for
> all .queue_rq() calls.

If only .queue_rq() from this queue is waited, blk_mq_freeze_queue() is
enough. If you want to wait .queue_rq() from all queues,
blk_mq_quiesce_queue() can't do that too.

Firstly there shouldn't be a 'struct request_queue' parameter passed to this
function, because you want to wait for .queue_rq() from all queues.

Secondaly your current implementation can't guarantee that point:
	- only hw queues of this queue are stopped in this function, and
	.queue_rq() from other queues still can be run once
	synchronize_rcu() returns.

	- for BLOCKING case, the SRCU is per hctx.

Thirdly I am still not sure why we want to wait for all .queue_rq(),
could you explain it a bit?

At least for mq scheduler switch, looks all schedule data is per request
queue, and we shouldn't have needed to wait all .queue_rq().

> 
> > So I suggest to unity both .run_work and .dealyed_run_work
> > into one work, just as what Jens did in the following link:
> > 
> > 	http://marc.info/?t=149183989800010&r=1&w=2
> 
> That should be done after this patch is upstream otherwise this
> patch won't apply to the stable trees.

OK.

Thanks,
Ming
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b75ef2392db7..3b3420f76b5a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1224,8 +1224,9 @@  EXPORT_SYMBOL(blk_mq_queue_stopped);
 
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-	cancel_work(&hctx->run_work);
-	cancel_delayed_work(&hctx->delay_work);
+	cancel_work_sync(&hctx->run_work);
+	cancel_delayed_work_sync(&hctx->delay_work);
+	cancel_delayed_work_sync(&hctx->delayed_run_work);
 	set_bit(BLK_MQ_S_STOPPED, &hctx->state);
 }
 EXPORT_SYMBOL(blk_mq_stop_hw_queue);