diff mbox series

[2/2] RDMA/srp: Rework SCSI device reset handling

Message ID 20190117002717.84686-3-bvanassche@acm.org (mailing list archive)
State Accepted
Headers show
Series Two SRP initiator bug fixes | expand

Commit Message

Bart Van Assche Jan. 17, 2019, 12:27 a.m. UTC
Since .scsi_done() must only be called after scsi_queue_rq() has
finished, make sure that the SRP initiator driver does not call
.scsi_done() while scsi_queue_rq() is in progress. Although
invoking sg_reset -d while I/O is in progress works fine with kernel
v4.20 and before, that is not the case with kernel v5.0-rc1. This
patch avoids that the following crash is triggered with kernel
v5.0-rc1:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000138
CPU: 0 PID: 360 Comm: kworker/0:1H Tainted: G    B             5.0.0-rc1-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
RIP: 0010:blk_mq_dispatch_rq_list+0x116/0xb10
Call Trace:
 blk_mq_sched_dispatch_requests+0x2f7/0x300
 __blk_mq_run_hw_queue+0xd6/0x180
 blk_mq_run_work_fn+0x27/0x30
 process_one_work+0x4f1/0xa20
 worker_thread+0x67/0x5b0
 kthread+0x1cf/0x1f0
 ret_from_fork+0x24/0x30

Cc: Sergey Gorenko <sergeygo@mellanox.com>
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: <stable@vger.kernel.org>
Fixes: 94a9174c630c ("IB/srp: reduce lock coverage of command completion") # v2.6.38
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

Comments

Christoph Hellwig Jan. 19, 2019, 10:04 a.m. UTC | #1
> +	/* Check whether all requests have finished. */
> +	blk_freeze_queue_start(q);
> +	time_left = blk_mq_freeze_queue_wait_timeout(q, 1 * HZ);
> +	blk_mq_unfreeze_queue(q);
>  
> +	return time_left > 0 ? SUCCESS : FAILED;

This is entirely generic SCSI/block evel functionality.  I'd rather have
a new WAIT_FOR_FREEZE return value from ->eh_device_reset_handler and
handle this in the SCSI midlayer.
Bart Van Assche Jan. 21, 2019, 9:08 p.m. UTC | #2
On 1/19/19 2:04 AM, Christoph Hellwig wrote:
>> +	/* Check whether all requests have finished. */
>> +	blk_freeze_queue_start(q);
>> +	time_left = blk_mq_freeze_queue_wait_timeout(q, 1 * HZ);
>> +	blk_mq_unfreeze_queue(q);
>>   
>> +	return time_left > 0 ? SUCCESS : FAILED;
> 
> This is entirely generic SCSI/block evel functionality.  I'd rather have
> a new WAIT_FOR_FREEZE return value from ->eh_device_reset_handler and
> handle this in the SCSI midlayer.

Hi Christoph,

Since a SCSI device must only reply to a reset task management function 
after all affected commands have completed, the only case in which that 
wait code is useful is if a regular reply is sent concurrently with the 
SCSI reset reply and the two replies get reordered. Since the SCSI error 
handler is able to deal with pending commands after a device reset, how 
about leaving out the queue freeze / unfreeze code?

Thanks,

Bart.
Bart Van Assche Jan. 22, 2019, 4:04 p.m. UTC | #3
On Tue, 2019-01-22 at 15:55 +0000, Sasha Levin wrote:
> [This is an automated email]
> 
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: 94a9174c630c IB/srp: reduce lock coverage of command completion.
> 
> The bot has tested the following trees: v4.20.3, v4.19.16, v4.14.94, v4.9.151, v4.4.171, v3.18.132.
> 
> v4.20.3: Build OK!
> v4.19.16: Build OK!
> v4.14.94: Build OK!
> v4.9.151: Build failed! Errors:
>     drivers/infiniband/ulp/srp/ib_srp.c:2657:2: error: implicit declaration of function ‘blk_freeze_queue_start’; did you mean ‘blk_mq_freeze_queue_start’? [-Werror=implicit-function-declaration]
>     drivers/infiniband/ulp/srp/ib_srp.c:2658:14: error: implicit declaration of function ‘blk_mq_freeze_queue_wait_timeout’; did you mean ‘blk_mq_freeze_queue_start’? [-Werror=implicit-function-
> declaration]
> 
> v4.4.171: Build failed! Errors:
>     drivers/infiniband/ulp/srp/ib_srp.c:2612:2: error: implicit declaration of function ‘blk_freeze_queue_start’; did you mean ‘blk_mq_freeze_queue_start’? [-Werror=implicit-function-declaration]
>     drivers/infiniband/ulp/srp/ib_srp.c:2613:14: error: implicit declaration of function ‘blk_mq_freeze_queue_wait_timeout’; did you mean ‘blk_mq_freeze_queue_start’? [-Werror=implicit-function-
> declaration]
> 
> v3.18.132: Failed to apply! Possible dependencies:
>     205619f2f824 ("IB/srp: Remove stale connection retry mechanism")
>     34aa654ecb8e ("IB/srp: Avoid that I/O hangs due to a cable pull during LUN scanning")
>     394c595ee8c3 ("IB/srp: Move ib_destroy_cm_id() call into srp_free_ch_ib()")
>     509c07bc1850 ("IB/srp: Separate target and channel variables")
>     747fe000ef38 ("IB/srp: Introduce two new srp_target_port member variables")
>     77f2c1a40e6f ("IB/srp: Use block layer tags")
>     d92c0da71a35 ("IB/srp: Add multichannel support")
> 
> 
> How should we proceed with this patch?

Hi Sasha,

Patch 2/2 does not have a "Cc: stable" tag because it definitely should NOT be
backported to older kernels. This patch only works for blk-mq which is fine with
kernel v5.0. Older kernels however support both the legacy block layer and blk-mq.

Bart.
diff mbox series

Patch

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 23e5c9afb8fb..f7ccbb07321b 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3036,9 +3036,11 @@  static int srp_abort(struct scsi_cmnd *scmnd)
 
 static int srp_reset_device(struct scsi_cmnd *scmnd)
 {
-	struct srp_target_port *target = host_to_target(scmnd->device->host);
+	struct scsi_device *sdev = scmnd->device;
+	struct srp_target_port *target = host_to_target(sdev->host);
 	struct srp_rdma_ch *ch;
-	int i, j;
+	struct request_queue *q = sdev->request_queue;
+	int time_left;
 	u8 status;
 
 	shost_printk(KERN_ERR, target->scsi_host, "SRP reset_device called\n");
@@ -3050,16 +3052,12 @@  static int srp_reset_device(struct scsi_cmnd *scmnd)
 	if (status)
 		return FAILED;
 
-	for (i = 0; i < target->ch_count; i++) {
-		ch = &target->ch[i];
-		for (j = 0; j < target->req_ring_size; ++j) {
-			struct srp_request *req = &ch->req_ring[j];
-
-			srp_finish_req(ch, req, scmnd->device, DID_RESET << 16);
-		}
-	}
+	/* Check whether all requests have finished. */
+	blk_freeze_queue_start(q);
+	time_left = blk_mq_freeze_queue_wait_timeout(q, 1 * HZ);
+	blk_mq_unfreeze_queue(q);
 
-	return SUCCESS;
+	return time_left > 0 ? SUCCESS : FAILED;
 }
 
 static int srp_reset_host(struct scsi_cmnd *scmnd)