diff mbox series

[v2,1/3] scsi: qla2xxx: avoid sending mailbox commands if firmware is stopped

Message ID 20200205214422.3657-2-mwilck@suse.com (mailing list archive)
State Changes Requested
Headers show
Series scsi: qla2xxx: fixes for driver unloading | expand

Commit Message

Martin Wilck Feb. 5, 2020, 9:44 p.m. UTC
From: Martin Wilck <mwilck@suse.com>

Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip"),
it is possible that FC commands are scheduled after the adapter firmware
has been shut down. IO sent to the firmware in this situation hangs
indefinitely. Avoid this for the LOGO code path that is typically taken
when adapters are shut down.

Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip")
Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
---
 drivers/scsi/qla2xxx/qla_mbx.c | 3 +++
 drivers/scsi/qla2xxx/qla_os.c  | 3 +++
 2 files changed, 6 insertions(+)

Comments

Daniel Wagner Feb. 6, 2020, 12:33 p.m. UTC | #1
Hi Martin,

On Wed, Feb 05, 2020 at 10:44:20PM +0100, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip"),
> it is possible that FC commands are scheduled after the adapter firmware
> has been shut down. IO sent to the firmware in this situation hangs
> indefinitely. Avoid this for the LOGO code path that is typically taken
> when adapters are shut down.
> 
> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>

Just to understand it correctly: 45235022da99 ("scsi: qla2xxx: Fix
driver unload by shutting down chip") is saying FW is not able to
shutdown propably and therefore we kill it first and still try to do
some cleanup.

Are you sure you got all the necessary places fixed up?

I tend to say we should add

	if (!ha->flags.fw_started)
		return QLA_FUNCTION_FAILED;

do qla2x00_mailbox_command() and handle the errors one by one. Just an
idea.

Thanks,
Daniel
Martin Wilck Feb. 6, 2020, 12:50 p.m. UTC | #2
On Thu, 2020-02-06 at 13:33 +0100, Daniel Wagner wrote:
> Hi Martin,
> 
> On Wed, Feb 05, 2020 at 10:44:20PM +0100, mwilck@suse.com wrote:
> > From: Martin Wilck <mwilck@suse.com>
> > 
> > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by
> > shutting down chip"),
> > it is possible that FC commands are scheduled after the adapter
> > firmware
> > has been shut down. IO sent to the firmware in this situation hangs
> > indefinitely. Avoid this for the LOGO code path that is typically
> > taken
> > when adapters are shut down.
> > 
> > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting
> > down chip")
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
> 
> Just to understand it correctly: 45235022da99 ("scsi: qla2xxx: Fix
> driver unload by shutting down chip") is saying FW is not able to
> shutdown propably and therefore we kill it first and still try to do
> some cleanup.

Yes, that's what I observed. Mailbox access hangs in this case, as the
mailbox simply won't execute the command and the expected status change
will not happen.

> Are you sure you got all the necessary places fixed up?
> 
> I tend to say we should add
> 
> 	if (!ha->flags.fw_started)
> 		return QLA_FUNCTION_FAILED;
> 
> do qla2x00_mailbox_command() and handle the errors one by one. Just
> an
> idea.

I had that idea too. I even tried it out. IIRC it's not that easy. Some
commands need to be sent to the mailbox even in shut-down state
(MBC_EXECUTE_FIRMWARE, for example?). I admit I didn't dare going down
the path you're suggesting, I thought it had quite a potential to cause
regressions.

Did you mean this as a negative review, or rather an additional
suggestion?

Thanks
Martin
Daniel Wagner Feb. 6, 2020, 1:03 p.m. UTC | #3
On Thu, Feb 06, 2020 at 01:50:32PM +0100, Martin Wilck wrote:
> On Thu, 2020-02-06 at 13:33 +0100, Daniel Wagner wrote:
> > Are you sure you got all the necessary places fixed up?
> > 
> > I tend to say we should add
> > 
> > 	if (!ha->flags.fw_started)
> > 		return QLA_FUNCTION_FAILED;
> > 
> > do qla2x00_mailbox_command() and handle the errors one by one. Just
> > an
> > idea.
> 
> I had that idea too. I even tried it out. IIRC it's not that easy. Some
> commands need to be sent to the mailbox even in shut-down state
> (MBC_EXECUTE_FIRMWARE, for example?). I admit I didn't dare going down
> the path you're suggesting, I thought it had quite a potential to cause
> regressions.

Yes, this certainly causing regressions. This patch is using the
blacklist approach which is probably introducing less regression.

But I think for finding all the right spots the white listing approach
is better. Maybe only during development phase. Just an idea.

> Did you mean this as a negative review, or rather an additional
> suggestion?

This patch clearly improves things. So yes, I approve.

Reviewed-by: Daniel Wagner <dwagner@suse.de>

Thanks,
Daniel
Arun Easi March 24, 2020, 11:51 p.m. UTC | #4
On Wed, 5 Feb 2020, 1:44pm, mwilck@suse.com wrote:

> From: Martin Wilck <mwilck@suse.com>
> 
> Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip"),
> it is possible that FC commands are scheduled after the adapter firmware
> has been shut down. IO sent to the firmware in this situation hangs
> indefinitely. Avoid this for the LOGO code path that is typically taken
> when adapters are shut down.
> 
> Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip")
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
> ---
>  drivers/scsi/qla2xxx/qla_mbx.c | 3 +++
>  drivers/scsi/qla2xxx/qla_os.c  | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c
> index 9e09964..53129f2 100644
> --- a/drivers/scsi/qla2xxx/qla_mbx.c
> +++ b/drivers/scsi/qla2xxx/qla_mbx.c
> @@ -2644,6 +2644,9 @@ qla24xx_fabric_logout(scsi_qla_host_t *vha, uint16_t loop_id, uint8_t domain,
>  	ql_dbg(ql_dbg_mbx + ql_dbg_verbose, vha, 0x106d,
>  	    "Entered %s.\n", __func__);
>  
> +	if (!ha->flags.fw_started)
> +		return QLA_FUNCTION_FAILED;
> +

Ok.

>  	lg = dma_pool_zalloc(ha->s_dma_pool, GFP_KERNEL, &lg_dma);
>  	if (lg == NULL) {
>  		ql_log(ql_log_warn, vha, 0x106e,
> diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
> index b520a98..2dafb46 100644
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -4878,6 +4878,9 @@ qla2x00_post_work(struct scsi_qla_host *vha, struct qla_work_evt *e)
>  	unsigned long flags;
>  	bool q = false;
>  
> +	if (!vha->hw->flags.fw_started)
> +		return QLA_FUNCTION_FAILED;
> +

I'd probably not do it here; rather in qla2x00_get_sp() 
/qla2x00_start_sp()/ qla2xxx_get_qpair_sp() time. Not all works are for 
posting to firmware (QLA_EVT_IDC_ACK, QLA_EVT_UNMAP etc.).

Regards,
-Arun

>  	spin_lock_irqsave(&vha->work_lock, flags);
>  	list_add_tail(&e->list, &vha->work_list);
>  
>
Martin Wilck March 25, 2020, 4:28 p.m. UTC | #5
On Tue, 2020-03-24 at 16:51 -0700, Arun Easi wrote:
> On Wed, 5 Feb 2020, 1:44pm, mwilck@suse.com wrote:
> 
> > From: Martin Wilck <mwilck@suse.com>
> > 
> > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by
> > shutting down chip"),
> > it is possible that FC commands are scheduled after the adapter
> > firmware
> > has been shut down. IO sent to the firmware in this situation hangs
> > indefinitely. Avoid this for the LOGO code path that is typically
> > taken
> > when adapters are shut down.
> > 
> > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting
> > down chip")
> > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
> > ---
> >  drivers/scsi/qla2xxx/qla_mbx.c | 3 +++
> >  drivers/scsi/qla2xxx/qla_os.c  | 3 +++
> >  2 files changed, 6 insertions(+)
> > 
> > 
> > --- a/drivers/scsi/qla2xxx/qla_os.c
> > +++ b/drivers/scsi/qla2xxx/qla_os.c
> > @@ -4878,6 +4878,9 @@ qla2x00_post_work(struct scsi_qla_host *vha,
> > struct qla_work_evt *e)
> >  	unsigned long flags;
> >  	bool q = false;
> >  
> > +	if (!vha->hw->flags.fw_started)
> > +		return QLA_FUNCTION_FAILED;
> > +
> 
> I'd probably not do it here; rather in qla2x00_get_sp() 
> /qla2x00_start_sp()/ qla2xxx_get_qpair_sp() time. Not all works are
> for 
> posting to firmware (QLA_EVT_IDC_ACK, QLA_EVT_UNMAP etc.).

qla81xx_idc_ack() calls qla2x00_mailbox_command(), which should be
avoided IMO. But I'll review the various cases and re-post the patch.

Thanks,
Martin



> 
> Regards,
> -Arun
> 
> >  	spin_lock_irqsave(&vha->work_lock, flags);
> >  	list_add_tail(&e->list, &vha->work_list);
> >  
> >
Martin Wilck March 25, 2020, 5:20 p.m. UTC | #6
On Wed, 2020-03-25 at 17:28 +0100, Martin Wilck wrote:
> On Tue, 2020-03-24 at 16:51 -0700, Arun Easi wrote:
> > On Wed, 5 Feb 2020, 1:44pm, mwilck@suse.com wrote:
> > 
> > > From: Martin Wilck <mwilck@suse.com>
> > > 
> > > Since commit 45235022da99 ("scsi: qla2xxx: Fix driver unload by
> > > shutting down chip"),
> > > it is possible that FC commands are scheduled after the adapter
> > > firmware
> > > has been shut down. IO sent to the firmware in this situation
> > > hangs
> > > indefinitely. Avoid this for the LOGO code path that is typically
> > > taken
> > > when adapters are shut down.
> > > 
> > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by
> > > shutting
> > > down chip")
> > > Signed-off-by: Martin Wilck <mwilck@suse.com>
> > > Reviewed-by: Roman Bolshakov <r.bolshakov@yadro.com>
> > > ---
> > >  drivers/scsi/qla2xxx/qla_mbx.c | 3 +++
> > >  drivers/scsi/qla2xxx/qla_os.c  | 3 +++
> > >  2 files changed, 6 insertions(+)
> > > 
> > > [...]
> > > --- a/drivers/scsi/qla2xxx/qla_os.c
> > > +++ b/drivers/scsi/qla2xxx/qla_os.c
> > > @@ -4878,6 +4878,9 @@ qla2x00_post_work(struct scsi_qla_host
> > > *vha,
> > > struct qla_work_evt *e)
> > >  	unsigned long flags;
> > >  	bool q = false;
> > >  
> > > +	if (!vha->hw->flags.fw_started)
> > > +		return QLA_FUNCTION_FAILED;
> > > +
> > 
> > I'd probably not do it here; rather in qla2x00_get_sp() 
> > /qla2x00_start_sp()/ qla2xxx_get_qpair_sp() time. Not all works are
> > for 
> > posting to firmware (QLA_EVT_IDC_ACK, QLA_EVT_UNMAP etc.).
> 
> qla81xx_idc_ack() calls qla2x00_mailbox_command(), which should be
> avoided IMO. But I'll review the various cases and re-post the patch.

Thinking about it once more, the approach is racy. 

qla2x00_try_stop_firmware() can be called any time, and it sets
fw_started = 0 *after* actually stopping the firmware. Even if we check
fw_started, the firwmare might be stopped between our check and our
actual mailbox / FW register access, and we'd end up hanging.

At least the check should be as close as possible to the actual FW
access, e.g. in qla2x00_mailbox_command(), or before writing to the
request queue registers in qla2x00_start_sp() etc.

Perhaps the (!fw_started) condition should be treated like
ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if
is_rom_cmd() returns true?

I can re-post, but I feel this should really be done by someone who
knows exactly how the firmware operates, IOW Marvell staff.

Regards
Martin
Martin Wilck March 25, 2020, 6:21 p.m. UTC | #7
On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote:
> Perhaps the (!fw_started) condition should be treated like
> ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if
> is_rom_cmd() returns true?

No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in
rom_cmds[], but has been found to hang (this is why I had the hunk in
qla24xx_fabric_logout()). The list of mailbox commands
that must be passed on when the FW is stopped has to be shorter than
rom_cmds[].

Better suggestions welcome.

Martin
Arun Easi March 25, 2020, 10:13 p.m. UTC | #8
On Wed, 25 Mar 2020, 11:21am, Martin Wilck wrote:

> On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote:
> > Perhaps the (!fw_started) condition should be treated like
> > ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if
> > is_rom_cmd() returns true?
> 
> No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in
> rom_cmds[], but has been found to hang (this is why I had the hunk in
> qla24xx_fabric_logout()). The list of mailbox commands
> that must be passed on when the FW is stopped has to be shorter than
> rom_cmds[].
> 

With your patch 2/3, where UNLOADING is set prior to the reset of chip, in 
place this issue should be largely addressed. Basically, paths that send 
out request to wire check UNLOADING bit (if any path is missing that, 
should add the check) before sending it out.

Now with UNLOADING set (with your patch 2/3), chip reset, and all 
outstanding command's completion called (qla2x00_abort_isp_cleanup) I see 
less chance of anything being sent out. If you see any issue with your 
patches 1 & 2 (addressing my comments) applied, let me know and we can 
tackle then. How about that?

Regards,
-Arun
Martin Wilck March 25, 2020, 10:41 p.m. UTC | #9
On Wed, 2020-03-25 at 15:13 -0700, Arun Easi wrote:
> On Wed, 25 Mar 2020, 11:21am, Martin Wilck wrote:
> 
> > On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote:
> > > Perhaps the (!fw_started) condition should be treated like
> > > ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if
> > > is_rom_cmd() returns true?
> > 
> > No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in
> > rom_cmds[], but has been found to hang (this is why I had the hunk
> > in
> > qla24xx_fabric_logout()). The list of mailbox commands
> > that must be passed on when the FW is stopped has to be shorter
> > than
> > rom_cmds[].
> > 
> 
> With your patch 2/3, where UNLOADING is set prior to the reset of
> chip, in 
> place this issue should be largely addressed. Basically, paths that
> send 
> out request to wire check UNLOADING bit (if any path is missing
> that, 
> should add the check) before sending it out.
> 
> Now with UNLOADING set (with your patch 2/3), chip reset, and all 
> outstanding command's completion called (qla2x00_abort_isp_cleanup) I
> see 
> less chance of anything being sent out. If you see any issue with
> your 
> patches 1 & 2 (addressing my comments) applied, let me know and we
> can 
> tackle then. How about that?

It sounds like a plan. Although it means that I just wasted time trying
to figure out which mailbox commands need to be processed even if the
firmware is down :-)

Thanks,
Martin
Arun Easi March 25, 2020, 11:35 p.m. UTC | #10
On Wed, 25 Mar 2020, 3:41pm, Martin Wilck wrote:

>
> ----------------------------------------------------------------------
> On Wed, 2020-03-25 at 15:13 -0700, Arun Easi wrote:
> > On Wed, 25 Mar 2020, 11:21am, Martin Wilck wrote:
> > 
> > > On Wed, 2020-03-25 at 18:20 +0100, Martin Wilck wrote:
> > > > Perhaps the (!fw_started) condition should be treated like
> > > > ABORT_ISP_ACTIVE in qla2x00_mailbox_command, i.e. execute only if
> > > > is_rom_cmd() returns true?
> > > 
> > > No, this won't be sufficient, as e.g. MBC_IOCB_COMMAND_A64 is in
> > > rom_cmds[], but has been found to hang (this is why I had the hunk
> > > in
> > > qla24xx_fabric_logout()). The list of mailbox commands
> > > that must be passed on when the FW is stopped has to be shorter
> > > than
> > > rom_cmds[].
> > > 
> > 
> > With your patch 2/3, where UNLOADING is set prior to the reset of 
> > chip, in place this issue should be largely addressed. Basically, 
> > paths that send out request to wire check UNLOADING bit (if any path 
> > is missing that, should add the check) before sending it out.
> > 
> > Now with UNLOADING set (with your patch 2/3), chip reset, and all 
> > outstanding command's completion called (qla2x00_abort_isp_cleanup) I 
> > see less chance of anything being sent out. If you see any issue with 
> > your patches 1 & 2 (addressing my comments) applied, let me know and 
> > we can tackle then. How about that?
> 
> It sounds like a plan. Although it means that I just wasted time trying
> to figure out which mailbox commands need to be processed even if the
> firmware is down :-)
> 

I hope it was not too much time. On the plus side, you know the mailbox 
path very well now. :)

Regards,
-Arun
diff mbox series

Patch

diff --git a/drivers/scsi/qla2xxx/qla_mbx.c b/drivers/scsi/qla2xxx/qla_mbx.c
index 9e09964..53129f2 100644
--- a/drivers/scsi/qla2xxx/qla_mbx.c
+++ b/drivers/scsi/qla2xxx/qla_mbx.c
@@ -2644,6 +2644,9 @@  qla24xx_fabric_logout(scsi_qla_host_t *vha, uint16_t loop_id, uint8_t domain,
 	ql_dbg(ql_dbg_mbx + ql_dbg_verbose, vha, 0x106d,
 	    "Entered %s.\n", __func__);
 
+	if (!ha->flags.fw_started)
+		return QLA_FUNCTION_FAILED;
+
 	lg = dma_pool_zalloc(ha->s_dma_pool, GFP_KERNEL, &lg_dma);
 	if (lg == NULL) {
 		ql_log(ql_log_warn, vha, 0x106e,
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index b520a98..2dafb46 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -4878,6 +4878,9 @@  qla2x00_post_work(struct scsi_qla_host *vha, struct qla_work_evt *e)
 	unsigned long flags;
 	bool q = false;
 
+	if (!vha->hw->flags.fw_started)
+		return QLA_FUNCTION_FAILED;
+
 	spin_lock_irqsave(&vha->work_lock, flags);
 	list_add_tail(&e->list, &vha->work_list);