diff mbox series

[2/2] scsi: Fix hang when device state is set via sysfs

Message ID 20211105221048.6541-3-michael.christie@oracle.com (mailing list archive)
State Accepted
Headers show
Series scsi: fix hang when device state is set via sysfs | expand

Commit Message

Mike Christie Nov. 5, 2021, 10:10 p.m. UTC
This fixes a regression added with:

commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
offlinining device")

The problem is that after iSCSI recovery, iscsid will call into the kernel
to set the dev's state to running, and with that patch we now call
scsi_rescan_device with the state_mutex held. If the scsi error handler
thread is just starting to test the device in scsi_send_eh_cmnd then it's
going to try to grab the state_mutex.

We are then stuck, because when scsi_rescan_device tries to send its IO
scsi_queue_rq calls -> scsi_host_queue_ready -> scsi_host_in_recovery
which will return true (the host state is still in recovery) and IO will
just be requeued. scsi_send_eh_cmnd will then never be able to grab the
state_mutex to finish error handling.

To prevent the deadlock this moves the rescan related code to after we
drop the state_mutex.

This also adds a check for if we are already in the running state. This
prevents extra scans and helps the iscsid case where if the transport
class has already onlined the device during it's recovery process then we
don't need userspace to do it again plus possibly block that daemon.

Fixes: f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
offlinining device")
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: lijinlin <lijinlin3@huawei.com>
Cc: Wu Bo <wubo40@huawei.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
---
 drivers/scsi/scsi_sysfs.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

Comments

Lee Duncan Nov. 6, 2021, 6:17 p.m. UTC | #1
On 11/5/21 3:10 PM, Mike Christie wrote:
> This fixes a regression added with:
> 
> commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> offlinining device")
> 
> The problem is that after iSCSI recovery, iscsid will call into the kernel
> to set the dev's state to running, and with that patch we now call
> scsi_rescan_device with the state_mutex held. If the scsi error handler
> thread is just starting to test the device in scsi_send_eh_cmnd then it's
> going to try to grab the state_mutex.
> 
> We are then stuck, because when scsi_rescan_device tries to send its IO
> scsi_queue_rq calls -> scsi_host_queue_ready -> scsi_host_in_recovery
> which will return true (the host state is still in recovery) and IO will
> just be requeued. scsi_send_eh_cmnd will then never be able to grab the
> state_mutex to finish error handling.
> 
> To prevent the deadlock this moves the rescan related code to after we
> drop the state_mutex.
> 
> This also adds a check for if we are already in the running state. This
> prevents extra scans and helps the iscsid case where if the transport
> class has already onlined the device during it's recovery process then we
> don't need userspace to do it again plus possibly block that daemon.
> 
> Fixes: f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> offlinining device")
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: lijinlin <lijinlin3@huawei.com>
> Cc: Wu Bo <wubo40@huawei.com>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/scsi/scsi_sysfs.c | 30 +++++++++++++++++++-----------
>  1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index a35841b34bfd..53e23a7bc0d3 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -797,6 +797,7 @@ store_state_field(struct device *dev, struct device_attribute *attr,
>  	int i, ret;
>  	struct scsi_device *sdev = to_scsi_device(dev);
>  	enum scsi_device_state state = 0;
> +	bool rescan_dev = false;
>  
>  	for (i = 0; i < ARRAY_SIZE(sdev_states); i++) {
>  		const int len = strlen(sdev_states[i].name);
> @@ -815,20 +816,27 @@ store_state_field(struct device *dev, struct device_attribute *attr,
>  	}
>  
>  	mutex_lock(&sdev->state_mutex);
> -	ret = scsi_device_set_state(sdev, state);
> -	/*
> -	 * If the device state changes to SDEV_RUNNING, we need to
> -	 * run the queue to avoid I/O hang, and rescan the device
> -	 * to revalidate it. Running the queue first is necessary
> -	 * because another thread may be waiting inside
> -	 * blk_mq_freeze_queue_wait() and because that call may be
> -	 * waiting for pending I/O to finish.
> -	 */
> -	if (ret == 0 && state == SDEV_RUNNING) {
> +	if (sdev->sdev_state == SDEV_RUNNING && state == SDEV_RUNNING) {
> +		ret = count;
> +	} else {
> +		ret = scsi_device_set_state(sdev, state);
> +		if (ret == 0 && state == SDEV_RUNNING)
> +			rescan_dev = true;
> +	}
> +	mutex_unlock(&sdev->state_mutex);
> +
> +	if (rescan_dev) {
> +		/*
> +		 * If the device state changes to SDEV_RUNNING, we need to
> +		 * run the queue to avoid I/O hang, and rescan the device
> +		 * to revalidate it. Running the queue first is necessary
> +		 * because another thread may be waiting inside
> +		 * blk_mq_freeze_queue_wait() and because that call may be
> +		 * waiting for pending I/O to finish.
> +		 */
>  		blk_mq_run_hw_queues(sdev->request_queue, true);
>  		scsi_rescan_device(dev);
>  	}
> -	mutex_unlock(&sdev->state_mutex);
>  
>  	return ret == 0 ? count : -EINVAL;
>  }
> 

Reviewed-by: Lee Duncan <lduncan@suse.com>
Wu Bo Nov. 8, 2021, 1:50 a.m. UTC | #2
On 2021/11/6 6:10, Mike Christie wrote:
> This fixes a regression added with:
> 
> commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> offlinining device")
> 
> The problem is that after iSCSI recovery, iscsid will call into the kernel
> to set the dev's state to running, and with that patch we now call
> scsi_rescan_device with the state_mutex held. If the scsi error handler
> thread is just starting to test the device in scsi_send_eh_cmnd then it's
> going to try to grab the state_mutex.
> 
> We are then stuck, because when scsi_rescan_device tries to send its IO
> scsi_queue_rq calls -> scsi_host_queue_ready -> scsi_host_in_recovery
> which will return true (the host state is still in recovery) and IO will
> just be requeued. scsi_send_eh_cmnd will then never be able to grab the
> state_mutex to finish error handling.
> 
> To prevent the deadlock this moves the rescan related code to after we
> drop the state_mutex.
> 
> This also adds a check for if we are already in the running state. This
> prevents extra scans and helps the iscsid case where if the transport
> class has already onlined the device during it's recovery process then we
> don't need userspace to do it again plus possibly block that daemon.
> 
> Fixes: f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> offlinining device")
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: lijinlin <lijinlin3@huawei.com>
> Cc: Wu Bo <wubo40@huawei.com>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>   drivers/scsi/scsi_sysfs.c | 30 +++++++++++++++++++-----------
>   1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index a35841b34bfd..53e23a7bc0d3 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -797,6 +797,7 @@ store_state_field(struct device *dev, struct device_attribute *attr,
>   	int i, ret;
>   	struct scsi_device *sdev = to_scsi_device(dev);
>   	enum scsi_device_state state = 0;
> +	bool rescan_dev = false;
>   
>   	for (i = 0; i < ARRAY_SIZE(sdev_states); i++) {
>   		const int len = strlen(sdev_states[i].name);
> @@ -815,20 +816,27 @@ store_state_field(struct device *dev, struct device_attribute *attr,
>   	}
>   
>   	mutex_lock(&sdev->state_mutex);
> -	ret = scsi_device_set_state(sdev, state);
> -	/*
> -	 * If the device state changes to SDEV_RUNNING, we need to
> -	 * run the queue to avoid I/O hang, and rescan the device
> -	 * to revalidate it. Running the queue first is necessary
> -	 * because another thread may be waiting inside
> -	 * blk_mq_freeze_queue_wait() and because that call may be
> -	 * waiting for pending I/O to finish.
> -	 */
> -	if (ret == 0 && state == SDEV_RUNNING) {
> +	if (sdev->sdev_state == SDEV_RUNNING && state == SDEV_RUNNING) {
> +		ret = count;
> +	} else {
> +		ret = scsi_device_set_state(sdev, state);
> +		if (ret == 0 && state == SDEV_RUNNING)
> +			rescan_dev = true;
> +	}
> +	mutex_unlock(&sdev->state_mutex);
> +
> +	if (rescan_dev) {
> +		/*
> +		 * If the device state changes to SDEV_RUNNING, we need to
> +		 * run the queue to avoid I/O hang, and rescan the device
> +		 * to revalidate it. Running the queue first is necessary
> +		 * because another thread may be waiting inside
> +		 * blk_mq_freeze_queue_wait() and because that call may be
> +		 * waiting for pending I/O to finish.
> +		 */
>   		blk_mq_run_hw_queues(sdev->request_queue, true);
>   		scsi_rescan_device(dev);
>   	}
> -	mutex_unlock(&sdev->state_mutex);
>   
>   	return ret == 0 ? count : -EINVAL;
>   }
> 

Reviewed-by: Wu Bo <wubo40@huawei.com>
James Bottomley Nov. 19, 2021, 1:35 p.m. UTC | #3
On Fri, 2021-11-05 at 17:10 -0500, Mike Christie wrote:
> This fixes a regression added with:
> 
> commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> offlinining device")
> 
> The problem is that after iSCSI recovery, iscsid will call into the
> kernel
> to set the dev's state to running, and with that patch we now call
> scsi_rescan_device with the state_mutex held. If the scsi error
> handler
> thread is just starting to test the device in scsi_send_eh_cmnd then
> it's
> going to try to grab the state_mutex.
> 
> We are then stuck, because when scsi_rescan_device tries to send its
> IO
> scsi_queue_rq calls -> scsi_host_queue_ready -> scsi_host_in_recovery
> which will return true (the host state is still in recovery) and IO
> will
> just be requeued. scsi_send_eh_cmnd will then never be able to grab
> the
> state_mutex to finish error handling.
> 
> To prevent the deadlock this moves the rescan related code to after
> we
> drop the state_mutex.
> 
> This also adds a check for if we are already in the running state.
> This
> prevents extra scans and helps the iscsid case where if the transport
> class has already onlined the device during it's recovery process
> then we
> don't need userspace to do it again plus possibly block that daemon.
> 
> Fixes: f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> offlinining device")
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: lijinlin <lijinlin3@huawei.com>
> Cc: Wu Bo <wubo40@huawei.com>
> Signed-off-by: Mike Christie <michael.christie@oracle.com>
> ---
>  drivers/scsi/scsi_sysfs.c | 30 +++++++++++++++++++-----------
>  1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index a35841b34bfd..53e23a7bc0d3 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -797,6 +797,7 @@ store_state_field(struct device *dev, struct
> device_attribute *attr,
>  	int i, ret;
>  	struct scsi_device *sdev = to_scsi_device(dev);
>  	enum scsi_device_state state = 0;
> +	bool rescan_dev = false;
>  
>  	for (i = 0; i < ARRAY_SIZE(sdev_states); i++) {
>  		const int len = strlen(sdev_states[i].name);
> @@ -815,20 +816,27 @@ store_state_field(struct device *dev, struct
> device_attribute *attr,
>  	}
>  
>  	mutex_lock(&sdev->state_mutex);
> -	ret = scsi_device_set_state(sdev, state);
> -	/*
> -	 * If the device state changes to SDEV_RUNNING, we need to
> -	 * run the queue to avoid I/O hang, and rescan the device
> -	 * to revalidate it. Running the queue first is necessary
> -	 * because another thread may be waiting inside
> -	 * blk_mq_freeze_queue_wait() and because that call may be
> -	 * waiting for pending I/O to finish.
> -	 */
> -	if (ret == 0 && state == SDEV_RUNNING) {
> +	if (sdev->sdev_state == SDEV_RUNNING && state == SDEV_RUNNING)
> {
> +		ret = count;

This looks wrong because of this

[...]
>  	return ret == 0 ? count : -EINVAL;

Don't we now return EINVAL on idempotent set state running because the
count is always non-zero?

I think the first statement should be 'ret = 0;' instead to cause
idempotent state setting to succeed as a nop.

James
Mike Christie Nov. 19, 2021, 5:13 p.m. UTC | #4
On 11/19/21 7:35 AM, James Bottomley wrote:
> On Fri, 2021-11-05 at 17:10 -0500, Mike Christie wrote:
>> This fixes a regression added with:
>>
>> commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
>> offlinining device")
>>
>> The problem is that after iSCSI recovery, iscsid will call into the
>> kernel
>> to set the dev's state to running, and with that patch we now call
>> scsi_rescan_device with the state_mutex held. If the scsi error
>> handler
>> thread is just starting to test the device in scsi_send_eh_cmnd then
>> it's
>> going to try to grab the state_mutex.
>>
>> We are then stuck, because when scsi_rescan_device tries to send its
>> IO
>> scsi_queue_rq calls -> scsi_host_queue_ready -> scsi_host_in_recovery
>> which will return true (the host state is still in recovery) and IO
>> will
>> just be requeued. scsi_send_eh_cmnd will then never be able to grab
>> the
>> state_mutex to finish error handling.
>>
>> To prevent the deadlock this moves the rescan related code to after
>> we
>> drop the state_mutex.
>>
>> This also adds a check for if we are already in the running state.
>> This
>> prevents extra scans and helps the iscsid case where if the transport
>> class has already onlined the device during it's recovery process
>> then we
>> don't need userspace to do it again plus possibly block that daemon.
>>
>> Fixes: f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
>> offlinining device")
>> Cc: Bart Van Assche <bvanassche@acm.org>
>> Cc: lijinlin <lijinlin3@huawei.com>
>> Cc: Wu Bo <wubo40@huawei.com>
>> Signed-off-by: Mike Christie <michael.christie@oracle.com>
>> ---
>>  drivers/scsi/scsi_sysfs.c | 30 +++++++++++++++++++-----------
>>  1 file changed, 19 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
>> index a35841b34bfd..53e23a7bc0d3 100644
>> --- a/drivers/scsi/scsi_sysfs.c
>> +++ b/drivers/scsi/scsi_sysfs.c
>> @@ -797,6 +797,7 @@ store_state_field(struct device *dev, struct
>> device_attribute *attr,
>>  	int i, ret;
>>  	struct scsi_device *sdev = to_scsi_device(dev);
>>  	enum scsi_device_state state = 0;
>> +	bool rescan_dev = false;
>>  
>>  	for (i = 0; i < ARRAY_SIZE(sdev_states); i++) {
>>  		const int len = strlen(sdev_states[i].name);
>> @@ -815,20 +816,27 @@ store_state_field(struct device *dev, struct
>> device_attribute *attr,
>>  	}
>>  
>>  	mutex_lock(&sdev->state_mutex);
>> -	ret = scsi_device_set_state(sdev, state);
>> -	/*
>> -	 * If the device state changes to SDEV_RUNNING, we need to
>> -	 * run the queue to avoid I/O hang, and rescan the device
>> -	 * to revalidate it. Running the queue first is necessary
>> -	 * because another thread may be waiting inside
>> -	 * blk_mq_freeze_queue_wait() and because that call may be
>> -	 * waiting for pending I/O to finish.
>> -	 */
>> -	if (ret == 0 && state == SDEV_RUNNING) {
>> +	if (sdev->sdev_state == SDEV_RUNNING && state == SDEV_RUNNING)
>> {
>> +		ret = count;
> 
> This looks wrong because of this
> 
> [...]
>>  	return ret == 0 ? count : -EINVAL;
> 
> Don't we now return EINVAL on idempotent set state running because the
> count is always non-zero?
> 
> I think the first statement should be 'ret = 0;' instead to cause
> idempotent state setting to succeed as a nop.
> 

You're right.

I'll resend this patchset with James's comment handled.
James Bottomley Nov. 19, 2021, 6:19 p.m. UTC | #5
On Fri, 2021-11-19 at 11:13 -0600, Mike Christie wrote:
> On 11/19/21 7:35 AM, James Bottomley wrote:
> > On Fri, 2021-11-05 at 17:10 -0500, Mike Christie wrote:
> > > This fixes a regression added with:
> > > 
> > > commit f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> > > offlinining device")
> > > 
> > > The problem is that after iSCSI recovery, iscsid will call into
> > > the kernel to set the dev's state to running, and with that patch
> > > we now call scsi_rescan_device with the state_mutex held. If the
> > > scsi error handler thread is just starting to test the device in
> > > scsi_send_eh_cmnd then it's going to try to grab the state_mutex.
> > > 
> > > We are then stuck, because when scsi_rescan_device tries to send
> > > its IO scsi_queue_rq calls -> scsi_host_queue_ready ->
> > > scsi_host_in_recovery which will return true (the host state is
> > > still in recovery) and IO will just be requeued.
> > > scsi_send_eh_cmnd will then never be able to grab the state_mutex
> > > to finish error handling.
> > > 
> > > To prevent the deadlock this moves the rescan related code to
> > > after we drop the state_mutex.
> > > 
> > > This also adds a check for if we are already in the running
> > > state. This prevents extra scans and helps the iscsid case where
> > > if the transport class has already onlined the device during it's
> > > recovery process then we don't need userspace to do it again plus
> > > possibly block that daemon.
> > > 
> > > Fixes: f0f82e2476f6 ("scsi: core: Fix capacity set to zero after
> > > offlinining device")
> > > Cc: Bart Van Assche <bvanassche@acm.org>
> > > Cc: lijinlin <lijinlin3@huawei.com>
> > > Cc: Wu Bo <wubo40@huawei.com>
> > > Signed-off-by: Mike Christie <michael.christie@oracle.com>
> > > ---
> > >  drivers/scsi/scsi_sysfs.c | 30 +++++++++++++++++++-----------
> > >  1 file changed, 19 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/drivers/scsi/scsi_sysfs.c
> > > b/drivers/scsi/scsi_sysfs.c
> > > index a35841b34bfd..53e23a7bc0d3 100644
> > > --- a/drivers/scsi/scsi_sysfs.c
> > > +++ b/drivers/scsi/scsi_sysfs.c
> > > @@ -797,6 +797,7 @@ store_state_field(struct device *dev, struct
> > > device_attribute *attr,
> > >  	int i, ret;
> > >  	struct scsi_device *sdev = to_scsi_device(dev);
> > >  	enum scsi_device_state state = 0;
> > > +	bool rescan_dev = false;
> > >  
> > >  	for (i = 0; i < ARRAY_SIZE(sdev_states); i++) {
> > >  		const int len = strlen(sdev_states[i].name);
> > > @@ -815,20 +816,27 @@ store_state_field(struct device *dev,
> > > struct
> > > device_attribute *attr,
> > >  	}
> > >  
> > >  	mutex_lock(&sdev->state_mutex);
> > > -	ret = scsi_device_set_state(sdev, state);
> > > -	/*
> > > -	 * If the device state changes to SDEV_RUNNING, we need to
> > > -	 * run the queue to avoid I/O hang, and rescan the device
> > > -	 * to revalidate it. Running the queue first is necessary
> > > -	 * because another thread may be waiting inside
> > > -	 * blk_mq_freeze_queue_wait() and because that call may be
> > > -	 * waiting for pending I/O to finish.
> > > -	 */
> > > -	if (ret == 0 && state == SDEV_RUNNING) {
> > > +	if (sdev->sdev_state == SDEV_RUNNING && state == SDEV_RUNNING)
> > > {
> > > +		ret = count;
> > 
> > This looks wrong because of this
> > 
> > [...]
> > >  	return ret == 0 ? count : -EINVAL;
> > 
> > Don't we now return EINVAL on idempotent set state running because
> > the count is always non-zero?
> > 
> > I think the first statement should be 'ret = 0;' instead to cause
> > idempotent state setting to succeed as a nop.
> > 
> 
> You're right.
> 
> I'll resend this patchset with James's comment handled.

Send an incremental patch because this is already in the queue for
Linus and he'd go ballistic if we had to rebase to remove it.

James
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index a35841b34bfd..53e23a7bc0d3 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -797,6 +797,7 @@  store_state_field(struct device *dev, struct device_attribute *attr,
 	int i, ret;
 	struct scsi_device *sdev = to_scsi_device(dev);
 	enum scsi_device_state state = 0;
+	bool rescan_dev = false;
 
 	for (i = 0; i < ARRAY_SIZE(sdev_states); i++) {
 		const int len = strlen(sdev_states[i].name);
@@ -815,20 +816,27 @@  store_state_field(struct device *dev, struct device_attribute *attr,
 	}
 
 	mutex_lock(&sdev->state_mutex);
-	ret = scsi_device_set_state(sdev, state);
-	/*
-	 * If the device state changes to SDEV_RUNNING, we need to
-	 * run the queue to avoid I/O hang, and rescan the device
-	 * to revalidate it. Running the queue first is necessary
-	 * because another thread may be waiting inside
-	 * blk_mq_freeze_queue_wait() and because that call may be
-	 * waiting for pending I/O to finish.
-	 */
-	if (ret == 0 && state == SDEV_RUNNING) {
+	if (sdev->sdev_state == SDEV_RUNNING && state == SDEV_RUNNING) {
+		ret = count;
+	} else {
+		ret = scsi_device_set_state(sdev, state);
+		if (ret == 0 && state == SDEV_RUNNING)
+			rescan_dev = true;
+	}
+	mutex_unlock(&sdev->state_mutex);
+
+	if (rescan_dev) {
+		/*
+		 * If the device state changes to SDEV_RUNNING, we need to
+		 * run the queue to avoid I/O hang, and rescan the device
+		 * to revalidate it. Running the queue first is necessary
+		 * because another thread may be waiting inside
+		 * blk_mq_freeze_queue_wait() and because that call may be
+		 * waiting for pending I/O to finish.
+		 */
 		blk_mq_run_hw_queues(sdev->request_queue, true);
 		scsi_rescan_device(dev);
 	}
-	mutex_unlock(&sdev->state_mutex);
 
 	return ret == 0 ? count : -EINVAL;
 }