diff mbox series

[RESEND] scsi: qla2xxx: I/Os timing out on surprise removal of

Message ID 1539703391-29633-1-git-send-email-William.Kuzeja@stratus.com (mailing list archive)
State Changes Requested
Headers show
Series [RESEND] scsi: qla2xxx: I/Os timing out on surprise removal of | expand

Commit Message

Bill Kuzeja Oct. 16, 2018, 3:23 p.m. UTC
When doing a surprise removal of an adapter, some in flight I/Os can get 
stuck and take a while to complete (they actually timeout and are 
retried). We are not handling an early error exit from 
qla2xxx_eh_abort properly.

Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") 
Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>
---

(Originally sent on September 26th, no reply so resending.)

After a hot remove of a Qlogic adapter, the driver's remove function gets 
called and we end up aborting all in progress I/Os. Here is the code flow:

qla2x00_remove_one
  qla2x00_abort_isp_cleanup
    qla2x00_abort_all_cmds
      __qla2x00_abort_all_cmds
        qla2xxx_eh_abort

At the start of qla2xxx_eh_abort, some sanity checks are done before 
actually sending the abort. One of these checks is a call to 
fc_block_scsi_eh. In the case of a hot remove, it turns out that this 
routine can exit with FAST_IO_FAIL.

When this occurs, we return back to __qla2x00_abort_all_cmds with an 
extra reference on sp (because the abort never gets sent). Originally, 
this was addressed with another fix:

commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after adapter break

But this later this added change complicated matters:

commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down chip

Because the abort is now being done earlier in the teardown (through 
qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past 
the first check because qla2x00_isp_reg_stat(ha) returns zero. When we
fail a few lines later in fc_block_scsi_eh, this error is not handled
properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and 
timing out because of the extra reference.

For this fix, a check for FAST_IO_FAIL is added to 
__qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort 
succeeded or not. 

This removes the extra reference in this additional early exit case. In 
my testing (hw surprise removals and also adapter remove via sysfs), this 
eliminates the timeouts and delays and the remove proceeds smoothly.

---
 drivers/scsi/qla2xxx/qla_os.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Laurence Oberman Oct. 16, 2018, 4:17 p.m. UTC | #1
On Tue, 2018-10-16 at 11:23 -0400, Bill Kuzeja wrote:
> When doing a surprise removal of an adapter, some in flight I/Os can
> get 
> stuck and take a while to complete (they actually timeout and are 
> retried). We are not handling an early error exit from 
> qla2xxx_eh_abort properly.
> 
> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting
> down chip") 
> Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>
> ---
> 
> (Originally sent on September 26th, no reply so resending.)
> 
> After a hot remove of a Qlogic adapter, the driver's remove function
> gets 
> called and we end up aborting all in progress I/Os. Here is the code
> flow:
> 
> qla2x00_remove_one
>   qla2x00_abort_isp_cleanup
>     qla2x00_abort_all_cmds
>       __qla2x00_abort_all_cmds
>         qla2xxx_eh_abort
> 
> At the start of qla2xxx_eh_abort, some sanity checks are done before 
> actually sending the abort. One of these checks is a call to 
> fc_block_scsi_eh. In the case of a hot remove, it turns out that
> this 
> routine can exit with FAST_IO_FAIL.
> 
> When this occurs, we return back to __qla2x00_abort_all_cmds with an 
> extra reference on sp (because the abort never gets sent).
> Originally, 
> this was addressed with another fix:
> 
> commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after
> adapter break
> 
> But this later this added change complicated matters:
> 
> commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down
> chip
> 
> Because the abort is now being done earlier in the teardown (through 
> qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past 
> the first check because qla2x00_isp_reg_stat(ha) returns zero. When
> we
> fail a few lines later in fc_block_scsi_eh, this error is not handled
> properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and 
> timing out because of the extra reference.
> 
> For this fix, a check for FAST_IO_FAIL is added to 
> __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort 
> succeeded or not. 
> 
> This removes the extra reference in this additional early exit case.
> In 
> my testing (hw surprise removals and also adapter remove via sysfs),
> this 
> eliminates the timeouts and delays and the remove proceeds smoothly.
> 
> ---
>  drivers/scsi/qla2xxx/qla_os.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_os.c
> b/drivers/scsi/qla2xxx/qla_os.c
> index 42b8f0d..3ba3765 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -1771,8 +1771,9 @@ uint32_t qla2x00_isp_reg_stat(struct
> qla_hw_data *ha)
>  					 * if immediate exit from
>  					 * ql2xxx_eh_abort
>  					 */
> -					if (status == FAILED &&
> -					    (qla2x00_isp_reg_stat(ha
> )))
> +					if (((status == FAILED) &&
> +					    (qla2x00_isp_reg_stat(ha
> ))) ||
> +					     (status ==
> FAST_IO_FAIL))
>  						atomic_dec(
>  						    &sp->ref_count);
>  				}

I recently bumped into this issue. The fix looks correct to me.

Reviewed-by Laurence Oberman <loberman@redhat.com>
Martin K. Petersen Oct. 18, 2018, 12:51 a.m. UTC | #2
Himanshu,

> When doing a surprise removal of an adapter, some in flight I/Os can get 
> stuck and take a while to complete (they actually timeout and are 
> retried). We are not handling an early error exit from 
> qla2xxx_eh_abort properly.
>
> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") 
> Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>

Please review. Thanks!
Madhani, Himanshu Oct. 18, 2018, 9:30 p.m. UTC | #3
Martin, Bill, 

> On Oct 17, 2018, at 5:51 PM, Martin K. Petersen <martin.petersen@oracle.com> wrote:
> 
> External Email
> 
> Himanshu,
> 
>> When doing a surprise removal of an adapter, some in flight I/Os can get
>> stuck and take a while to complete (they actually timeout and are
>> retried). We are not handling an early error exit from
>> qla2xxx_eh_abort properly.
>> 
>> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip")
>> Signed-off-by: Bill Kuzeja <william.kuzeja@stratus.com>
> 
> Please review. Thanks!
> 
> --
> Martin K. Petersen      Oracle Linux Engineering

Thanks for the patience. 

This patch Looks good to me

Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com>

Thanks,
- Himanshu
Martin K. Petersen Oct. 19, 2018, 10:24 p.m. UTC | #4
Bill,

> When doing a surprise removal of an adapter, some in flight I/Os can
> get stuck and take a while to complete (they actually timeout and are
> retried). We are not handling an early error exit from
> qla2xxx_eh_abort properly.

This doesn't apply to 4.20/scsi-queue and the surrounding code has
changed significantly.
Bill Kuzeja Oct. 23, 2018, 5:59 p.m. UTC | #5
This is still a bug in 4.20/scsi-queue. I am sending a new patch that applies cleanly to 
4.20/scsi-queue as it stands now. I have tested it successfully.

Regards

  -Bill

-----Original Message-----
From: Martin K. Petersen [mailto:martin.petersen@oracle.com] 
Sent: Friday, October 19, 2018 6:24 PM
To: Kuzeja, William <william.kuzeja@stratus.com>
Cc: linux-scsi@vger.kernel.org; qla2xxx-upstream@qlogic.com
Subject: Re: [PATCH RESEND] scsi: qla2xxx: I/Os timing out on surprise removal of


Bill,

> When doing a surprise removal of an adapter, some in flight I/Os can
> get stuck and take a while to complete (they actually timeout and are
> retried). We are not handling an early error exit from
> qla2xxx_eh_abort properly.

This doesn't apply to 4.20/scsi-queue and the surrounding code has
changed significantly.
diff mbox series

Patch

diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 42b8f0d..3ba3765 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -1771,8 +1771,9 @@  uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha)
 					 * if immediate exit from
 					 * ql2xxx_eh_abort
 					 */
-					if (status == FAILED &&
-					    (qla2x00_isp_reg_stat(ha)))
+					if (((status == FAILED) &&
+					    (qla2x00_isp_reg_stat(ha))) ||
+					     (status == FAST_IO_FAIL))
 						atomic_dec(
 						    &sp->ref_count);
 				}