diff mbox series

I/O errors for ALUA state transitions

Message ID 20210907071605.48968-1-hare@suse.de (mailing list archive)
State Changes Requested
Headers show
Series I/O errors for ALUA state transitions | expand

Commit Message

Hannes Reinecke Sept. 7, 2021, 7:16 a.m. UTC
From: Rajashekhar M A <rajs@netapp.com>

When a host is configured with a few LUNs and IO is running,
injecting FC faults repeatedly leads to path recovery problems.
The LUNs have 4 paths each and 3 of them come back active after
say an FC fault which makes two of the paths go down, instead of
all 4. This happens after several iterations of continuous FC faults.

Reason here is that we're returning an I/O error whenever we're
encountering sense code 06/04/0a (LOGICAL UNIT NOT ACCESSIBLE,
ASYMMETRIC ACCESS STATE TRANSITION) instead of retrying.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/scsi_error.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Mike Christie Sept. 7, 2021, 3:59 p.m. UTC | #1
On 9/7/21 2:16 AM, Hannes Reinecke wrote:
> From: Rajashekhar M A <rajs@netapp.com>
> 
> When a host is configured with a few LUNs and IO is running,
> injecting FC faults repeatedly leads to path recovery problems.
> The LUNs have 4 paths each and 3 of them come back active after
> say an FC fault which makes two of the paths go down, instead of
> all 4. This happens after several iterations of continuous FC faults.
> 
> Reason here is that we're returning an I/O error whenever we're
> encountering sense code 06/04/0a (LOGICAL UNIT NOT ACCESSIBLE,
> ASYMMETRIC ACCESS STATE TRANSITION) instead of retrying.
> > Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/scsi/scsi_error.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 03a2ff547b22..1185083105ae 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -594,10 +594,11 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>  		    sshdr.asc == 0x3f && sshdr.ascq == 0x0e)
>  			return NEEDS_RETRY;
>  		/*
> -		 * if the device is in the process of becoming ready, we
> -		 * should retry.
> +		 * if the device is in the process of becoming ready, or
> +		 * transitions between ALUA states, we should retry.
>  		 */
> -		if ((sshdr.asc == 0x04) && (sshdr.ascq == 0x01))
> +		if ((sshdr.asc == 0x04) &&
> +		    (sshdr.ascq == 0x01 || sshdr.ascq == 0x0a))
>  			return NEEDS_RETRY;

Why put this here instead of in alua_check_sense with the other ALUA
state transition check?
Hannes Reinecke Sept. 8, 2021, 6:12 a.m. UTC | #2
On 9/7/21 5:59 PM, michael.christie@oracle.com wrote:
> On 9/7/21 2:16 AM, Hannes Reinecke wrote:
>> From: Rajashekhar M A <rajs@netapp.com>
>>
>> When a host is configured with a few LUNs and IO is running,
>> injecting FC faults repeatedly leads to path recovery problems.
>> The LUNs have 4 paths each and 3 of them come back active after
>> say an FC fault which makes two of the paths go down, instead of
>> all 4. This happens after several iterations of continuous FC faults.
>>
>> Reason here is that we're returning an I/O error whenever we're
>> encountering sense code 06/04/0a (LOGICAL UNIT NOT ACCESSIBLE,
>> ASYMMETRIC ACCESS STATE TRANSITION) instead of retrying.
>>> Signed-off-by: Hannes Reinecke <hare@suse.de>
>> ---
>>   drivers/scsi/scsi_error.c | 7 ++++---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index 03a2ff547b22..1185083105ae 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -594,10 +594,11 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>>   		    sshdr.asc == 0x3f && sshdr.ascq == 0x0e)
>>   			return NEEDS_RETRY;
>>   		/*
>> -		 * if the device is in the process of becoming ready, we
>> -		 * should retry.
>> +		 * if the device is in the process of becoming ready, or
>> +		 * transitions between ALUA states, we should retry.
>>   		 */
>> -		if ((sshdr.asc == 0x04) && (sshdr.ascq == 0x01))
>> +		if ((sshdr.asc == 0x04) &&
>> +		    (sshdr.ascq == 0x01 || sshdr.ascq == 0x0a))
>>   			return NEEDS_RETRY;
> 
> Why put this here instead of in alua_check_sense with the other ALUA
> state transition check?
> 
Good point. Will be updating the patch.

Cheers,

Hannes
Ewan Milne Sept. 8, 2021, 9:30 p.m. UTC | #3
alua_check_sense() checks for NOT READY / 04h 0Ah but not
UNIT ATTENTION with the same ASC/ASCQ.

Careful about updating pg->state in this case though because that
may cause subsequent I/O in the probe to fail, we didn't do this before
and we *may* only get the UA once.

-Ewan

On Wed, Sep 8, 2021 at 2:12 AM Hannes Reinecke <hare@suse.de> wrote:
>
> On 9/7/21 5:59 PM, michael.christie@oracle.com wrote:
> > On 9/7/21 2:16 AM, Hannes Reinecke wrote:
> >> From: Rajashekhar M A <rajs@netapp.com>
> >>
> >> When a host is configured with a few LUNs and IO is running,
> >> injecting FC faults repeatedly leads to path recovery problems.
> >> The LUNs have 4 paths each and 3 of them come back active after
> >> say an FC fault which makes two of the paths go down, instead of
> >> all 4. This happens after several iterations of continuous FC faults.
> >>
> >> Reason here is that we're returning an I/O error whenever we're
> >> encountering sense code 06/04/0a (LOGICAL UNIT NOT ACCESSIBLE,
> >> ASYMMETRIC ACCESS STATE TRANSITION) instead of retrying.
> >>> Signed-off-by: Hannes Reinecke <hare@suse.de>
> >> ---
> >>   drivers/scsi/scsi_error.c | 7 ++++---
> >>   1 file changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> >> index 03a2ff547b22..1185083105ae 100644
> >> --- a/drivers/scsi/scsi_error.c
> >> +++ b/drivers/scsi/scsi_error.c
> >> @@ -594,10 +594,11 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
> >>                  sshdr.asc == 0x3f && sshdr.ascq == 0x0e)
> >>                      return NEEDS_RETRY;
> >>              /*
> >> -             * if the device is in the process of becoming ready, we
> >> -             * should retry.
> >> +             * if the device is in the process of becoming ready, or
> >> +             * transitions between ALUA states, we should retry.
> >>               */
> >> -            if ((sshdr.asc == 0x04) && (sshdr.ascq == 0x01))
> >> +            if ((sshdr.asc == 0x04) &&
> >> +                (sshdr.ascq == 0x01 || sshdr.ascq == 0x0a))
> >>                      return NEEDS_RETRY;
> >
> > Why put this here instead of in alua_check_sense with the other ALUA
> > state transition check?
> >
> Good point. Will be updating the patch.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
>
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 03a2ff547b22..1185083105ae 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -594,10 +594,11 @@  enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		    sshdr.asc == 0x3f && sshdr.ascq == 0x0e)
 			return NEEDS_RETRY;
 		/*
-		 * if the device is in the process of becoming ready, we
-		 * should retry.
+		 * if the device is in the process of becoming ready, or
+		 * transitions between ALUA states, we should retry.
 		 */
-		if ((sshdr.asc == 0x04) && (sshdr.ascq == 0x01))
+		if ((sshdr.asc == 0x04) &&
+		    (sshdr.ascq == 0x01 || sshdr.ascq == 0x0a))
 			return NEEDS_RETRY;
 		/*
 		 * if the device is not started, we need to wake