diff mbox

[1/2] blk-mq: don't complete un-started request in timeout handler

Message ID 1489064578-17305-3-git-send-email-tom.leiming@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei March 9, 2017, 1:02 p.m. UTC
When iterating busy request in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer, and not
dispatched to low level driver yet.

In current implementation of blk_mq_check_expired(),
in case that the request queue becomes dying, un-started
requests are completed immediately. This way can cause
use-after-free issue[1][2] when doing I/O and removing&resetting
NVMe device.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[  581.789754] ------------[ cut here ]------------
[  581.789758] kernel BUG at block/blk-mq.c:374!
[  581.789760] invalid opcode: 0000 [#1] SMP
[  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[  581.789804] Workqueue: kblockd blk_mq_timeout_work
[  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[  581.789820] Call Trace:
[  581.789825]  blk_mq_check_expired+0x76/0x80
[  581.789828]  bt_iter+0x45/0x50
[  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789840]  ? __switch_to+0x140/0x450
[  581.789841]  blk_mq_timeout_work+0x88/0x170
[  581.789845]  process_one_work+0x165/0x410
[  581.789847]  worker_thread+0x137/0x4c0
[  581.789851]  kthread+0x101/0x140
[  581.789853]  ? rescuer_thread+0x3b0/0x3b0
[  581.789855]  ? kthread_park+0x90/0x90
[  581.789860]  ret_from_fork+0x2c/0x40
[  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451]  ? mempool_free+0x2b/0x80
[ 6984.857455]  ? bio_free+0x4e/0x60
[ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
[ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
[ 6984.857473]  process_one_work+0x165/0x410
[ 6984.857475]  worker_thread+0x137/0x4c0
[ 6984.857478]  kthread+0x101/0x140
[ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481]  ? kthread_park+0x90/0x90
[ 6984.857489]  ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: stable@vger.kernel.org
Reported-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
---
 block/blk-mq.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

Comments

Bart Van Assche March 15, 2017, 12:07 a.m. UTC | #1
On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..0aff380099d5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  {
>  	struct blk_mq_timeout_data *data = priv;
>  
> -	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> -		/*
> -		 * If a request wasn't started before the queue was
> -		 * marked dying, kill it here or it'll go unnoticed.
> -		 */
> -		if (unlikely(blk_queue_dying(rq->q))) {
> -			rq->errors = -EIO;
> -			blk_mq_end_request(rq, rq->errors);
> -		}
> +	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>  		return;
> -	}
>  
>  	if (time_after_eq(jiffies, rq->deadline)) {
>  		if (!blk_mark_rq_complete(rq))

Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()
or __blk_mq_requeue_request(). Another issue with this function is that the
request passed to this function can be reinitialized concurrently. Sorry but
I'm not sure what the best way is to address these issues.

Bart.
Ming Lei March 15, 2017, 12:18 p.m. UTC | #2
On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 159187a28d66..0aff380099d5 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
> >  {
> >  	struct blk_mq_timeout_data *data = priv;
> >  
> > -	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> > -		/*
> > -		 * If a request wasn't started before the queue was
> > -		 * marked dying, kill it here or it'll go unnoticed.
> > -		 */
> > -		if (unlikely(blk_queue_dying(rq->q))) {
> > -			rq->errors = -EIO;
> > -			blk_mq_end_request(rq, rq->errors);
> > -		}
> > +	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> >  		return;
> > -	}
> >  
> >  	if (time_after_eq(jiffies, rq->deadline)) {
> >  		if (!blk_mark_rq_complete(rq))
> 
> Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()

blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.

From view of __blk_mq_finish_request():

	- if it is run from merge queue io path(blk_mq_merge_queue_io()),
	blk_mq_start_request() can't be run at all, and the COMPLETE flag
	is kept as previous value(zero)

	- if it is run from normal complete path, COMPLETE flag is cleared
	before the req/tag is released to tag set.

so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
wrt. timeout.

> or __blk_mq_requeue_request(). Another issue with this function is that the

__blk_mq_requeue_request() can be run from two pathes:

	- dispatch failure, in which case the req/tag isn't released to tag set
	
	- IO completion path, in which COMPLETE flag is cleared before requeue.
	
so I can't see races with timeout in case of start rq vs. requeue rq. 

> request passed to this function can be reinitialized concurrently. Sorry but

Yes, that is possible, but rq->atomic_flags is kept, and in that case
when we handle timeout, COMPLETE is cleared before releasing the rq/tag to
tag set via blk_mark_rq_complete(), so we won't complete that req twice.



Thanks,
Ming
Ming Lei March 15, 2017, 12:40 p.m. UTC | #3
On Wed, Mar 15, 2017 at 08:18:53PM +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> 
> > or __blk_mq_requeue_request(). Another issue with this function is that the
> 
> __blk_mq_requeue_request() can be run from two pathes:
> 
> 	- dispatch failure, in which case the req/tag isn't released to tag set
> 	
> 	- IO completion path, in which COMPLETE flag is cleared before requeue.
> 	
> so I can't see races with timeout in case of start rq vs. requeue rq. 

Actually rq/tag won't be released to tag set if it will be requeued, so
the timeout race is nothing to do with requeue.

Thanks,
Ming
Bart Van Assche March 15, 2017, 3:36 p.m. UTC | #4
On Wed, 2017-03-15 at 20:40 +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 08:18:53PM +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > 
> > > or __blk_mq_requeue_request(). Another issue with this function is that the
> > 
> > __blk_mq_requeue_request() can be run from two pathes:
> > 
> > 	- dispatch failure, in which case the req/tag isn't released to tag set
> > 	
> > 	- IO completion path, in which COMPLETE flag is cleared before requeue.
> > 	
> > so I can't see races with timeout in case of start rq vs. requeue rq. 
> 
> Actually rq/tag won't be released to tag set if it will be requeued, so
> the timeout race is nothing to do with requeue.

Hello Ming,

Please have another look at __blk_mq_requeue_request(). In that function
the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
&rq->atomic_flags)) { ... }

I think the REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
__blk_mq_requeue_request().

Bart.
Ming Lei March 15, 2017, 4:22 p.m. UTC | #5
On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
> On Wed, 2017-03-15 at 20:40 +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 08:18:53PM +0800, Ming Lei wrote:
> > > On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > > 
> > > > or __blk_mq_requeue_request(). Another issue with this function is that the
> > > 
> > > __blk_mq_requeue_request() can be run from two pathes:
> > > 
> > > 	- dispatch failure, in which case the req/tag isn't released to tag set
> > > 	
> > > 	- IO completion path, in which COMPLETE flag is cleared before requeue.
> > > 	
> > > so I can't see races with timeout in case of start rq vs. requeue rq. 
> > 
> > Actually rq/tag won't be released to tag set if it will be requeued, so
> > the timeout race is nothing to do with requeue.
> 
> Hello Ming,
> 
> Please have another look at __blk_mq_requeue_request(). In that function
> the following code occurs: if (test_and_clear_bit(REQ_ATOM_STARTED,
> &rq->atomic_flags)) { ... }
> 
> I think the REQ_ATOM_STARTED check in blk_mq_check_expired() races with the
> test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags) call in
> __blk_mq_requeue_request().

OK, this race should only exist in case that the requeue happens after dispatch
busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
no such race because COMPLETE flag is set.

One solution I thought of is to call blk_mark_rq_complete() before requeuing
when dispatch busy happened, but that looks a bit silly. Another way is to
set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
reasonable too. Any comments on the 2nd solution?


Thanks,
Ming
Ming Lei March 15, 2017, 4:46 p.m. UTC | #6
On Thu, Mar 16, 2017 at 12:22:01AM +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 03:36:31PM +0000, Bart Van Assche wrote:
> 
> OK, this race should only exist in case that the requeue happens after dispatch
> busy, because COMPLETE flag isn't set. And if the requeue is from io completion,
> no such race because COMPLETE flag is set.
> 
> One solution I thought of is to call blk_mark_rq_complete() before requeuing
> when dispatch busy happened, but that looks a bit silly. Another way is to
> set STARTED flag just after .queue_rq returns BLK_MQ_RQ_QUEUE_OK, which looks
> reasonable too. Any comments on the 2nd solution?

Actually it isn't possible to happen because rq->deadline is just set
in blk_mq_start_request() called from .queue_rq, and it won't trigger
timeout handling even STARTED is observed as true in blk_mq_check_expired()
because timeout period is often set as big enough. So it is still safe, isn't it?

But this situation should have been commented.

Thanks,
Ming
Bart Van Assche March 15, 2017, 9:34 p.m. UTC | #7
On Wed, 2017-03-15 at 20:18 +0800, Ming Lei wrote:
> On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> > be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()
> 
> blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.
> 
> From view of __blk_mq_finish_request():
> 
> 	- if it is run from merge queue io path(blk_mq_merge_queue_io()),
> 	blk_mq_start_request() can't be run at all, and the COMPLETE flag
> 	is kept as previous value(zero)
> 
> 	- if it is run from normal complete path, COMPLETE flag is cleared
> 	before the req/tag is released to tag set.
> 
> so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
> wrt. timeout.
> 
> > or __blk_mq_requeue_request(). Another issue with this function is that the
> 
> __blk_mq_requeue_request() can be run from two pathes:
> 
> 	- dispatch failure, in which case the req/tag isn't released to tag set
> 	
> 	- IO completion path, in which COMPLETE flag is cleared before requeue.
> 	
> so I can't see races with timeout in case of start rq vs. requeue rq. 
> 
> > request passed to this function can be reinitialized concurrently.

Hello Ming,

You misinterpret what I wrote. I was referring to manipulation of
REQ_ATOM_STARTED from different contexts and not to what you explained.

Bart.
Ming Lei March 15, 2017, 11:41 p.m. UTC | #8
On Wed, Mar 15, 2017 at 09:34:50PM +0000, Bart Van Assche wrote:
> On Wed, 2017-03-15 at 20:18 +0800, Ming Lei wrote:
> > On Wed, Mar 15, 2017 at 12:07:37AM +0000, Bart Van Assche wrote:
> > > Both the old and the new check look racy to me. The REQ_ATOM_STARTED bit can
> > > be changed concurrently by blk_mq_start_request(), __blk_mq_finish_request()
> > 
> > blk_mq_start_request() and __blk_mq_finish_request() won't be run concurrently.
> > 
> > From view of __blk_mq_finish_request():
> > 
> > 	- if it is run from merge queue io path(blk_mq_merge_queue_io()),
> > 	blk_mq_start_request() can't be run at all, and the COMPLETE flag
> > 	is kept as previous value(zero)
> > 
> > 	- if it is run from normal complete path, COMPLETE flag is cleared
> > 	before the req/tag is released to tag set.
> > 
> > so there isn't race in blk_mq_start_request() vs. __blk_mq_finish_request()
> > wrt. timeout.
> > 
> > > or __blk_mq_requeue_request(). Another issue with this function is that the
> > 
> > __blk_mq_requeue_request() can be run from two pathes:
> > 
> > 	- dispatch failure, in which case the req/tag isn't released to tag set
> > 	
> > 	- IO completion path, in which COMPLETE flag is cleared before requeue.
> > 	
> > so I can't see races with timeout in case of start rq vs. requeue rq. 
> > 
> > > request passed to this function can be reinitialized concurrently.
> 
> Hello Ming,
> 
> You misinterpret what I wrote. I was referring to manipulation of
> REQ_ATOM_STARTED from different contexts and not to what you explained.

This patch addresses one race between timeout and pre-queuing I/O in block layer
(before .queue_rq), please focus on this patch. And that is definitely correct.

Also I am happy to discuss this kind of races, but maybe we can do that in
another thread if you can describe the issue clearly.
Bart Van Assche March 16, 2017, 9:37 p.m. UTC | #9
On Thu, 2017-03-09 at 21:02 +0800, Ming Lei wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 159187a28d66..0aff380099d5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -697,17 +697,8 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
>  {
>  	struct blk_mq_timeout_data *data = priv;
>  
> -	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> -		/*
> -		 * If a request wasn't started before the queue was
> -		 * marked dying, kill it here or it'll go unnoticed.
> -		 */
> -		if (unlikely(blk_queue_dying(rq->q))) {
> -			rq->errors = -EIO;
> -			blk_mq_end_request(rq, rq->errors);
> -		}
> +	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
>  		return;
> -	}
>  
>  	if (time_after_eq(jiffies, rq->deadline)) {
>  		if (!blk_mark_rq_complete(rq))

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 159187a28d66..0aff380099d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -697,17 +697,8 @@  static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 {
 	struct blk_mq_timeout_data *data = priv;
 
-	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
-		/*
-		 * If a request wasn't started before the queue was
-		 * marked dying, kill it here or it'll go unnoticed.
-		 */
-		if (unlikely(blk_queue_dying(rq->q))) {
-			rq->errors = -EIO;
-			blk_mq_end_request(rq, rq->errors);
-		}
+	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
-	}
 
 	if (time_after_eq(jiffies, rq->deadline)) {
 		if (!blk_mark_rq_complete(rq))