From patchwork Wed Oct 24 15:49:32 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bill Kuzeja X-Patchwork-Id: 10654777 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9DDBD14BD for ; Wed, 24 Oct 2018 15:56:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 91C512ACEB for ; Wed, 24 Oct 2018 15:56:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 849F52ADF2; Wed, 24 Oct 2018 15:56:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D764029B94 for ; Wed, 24 Oct 2018 15:56:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726747AbeJYAYm convert rfc822-to-8bit (ORCPT ); Wed, 24 Oct 2018 20:24:42 -0400 Received: from us-smtp-delivery-131.mimecast.com ([216.205.24.131]:24542 "EHLO us-smtp-delivery-131.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726577AbeJYAYm (ORCPT ); Wed, 24 Oct 2018 20:24:42 -0400 X-Greylist: delayed 389 seconds by postgrey-1.27 at vger.kernel.org; Wed, 24 Oct 2018 20:24:41 EDT Received: from mailhub4.stratus.com (mailhub4.stratus.com [134.111.1.17]) by us-smtp-1.mimecast.com with ESMTP id us-mta-73-7ZK9D3pEOYm_t4WshjXxOw-1; Wed, 24 Oct 2018 11:49:34 -0400 Received: from EXHQ1.corp.stratus.com (exhq1.corp.stratus.com [134.111.200.125]) by mailhub4.stratus.com (8.12.11/8.12.11) with ESMTP id w9OFnXAV010622; Wed, 24 Oct 2018 11:49:34 -0400 Received: from linuxdev.lnx.eng.stratus.com (134.111.220.63) by EXHQ1.corp.stratus.com (134.111.200.125) with Microsoft SMTP Server (TLS) id 14.3.279.2; Wed, 24 Oct 2018 11:49:08 -0400 From: Bill Kuzeja To: CC: , Subject: [PATCH V2] Timeouts occur on surprise removal of QLogic adapter Date: Wed, 24 Oct 2018 11:49:32 -0400 Message-ID: <1540396172-17436-1-git-send-email-William.Kuzeja@stratus.com> X-Mailer: git-send-email 1.8.3.1 MIME-Version: 1.0 X-MC-Unique: 7ZK9D3pEOYm_t4WshjXxOw-1 Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-scsi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP When doing a surprise removal of an adapter, some in flight I/Os can get stuck and take a while to complete (they actually timeout and are retried). We are not handling an early error exit from qla2xxx_eh_abort properly. Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting down chip") Signed-off-by: Bill Kuzeja --- Note: (Reworked ACKed patch to cleanly apply to 4.20/scsi-queue as it stands today - please ignore patch I sent yesterday). Patch explanation: After a hot remove of a Qlogic adapter, the driver's remove function gets called and we end up aborting all in progress I/Os. Here is the code flow: qla2x00_remove_one qla2x00_abort_isp_cleanup qla2x00_abort_all_cmds __qla2x00_abort_all_cmds qla2xxx_eh_abort At the start of qla2xxx_eh_abort, some sanity checks are done before actually sending the abort. One of these checks is a call to fc_block_scsi_eh. In the case of a hot remove, it turns out that this routine can exit with FAST_IO_FAIL. When this occurs, we return back to __qla2x00_abort_all_cmds with an extra reference on sp (because the abort never gets sent). Originally, this was addressed with another fix: commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after adapter break But this later this added change complicated matters: commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down chip Because the abort is now being done earlier in the teardown (through qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past the first check because qla2x00_isp_reg_stat(ha) returns zero. When we fail a few lines later in fc_block_scsi_eh, this error is not handled properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and timing out because of the extra reference. For this fix, a check for FAST_IO_FAIL is added to __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort succeeded or not. This removes the extra reference in this additional early exit case. In my testing (hw surprise removals and also adapter remove via sysfs), this eliminates the timeouts and delays and the remove proceeds smoothly. --- drivers/scsi/qla2xxx/qla_os.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index 01607d2..8ee05d9 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -1749,7 +1749,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) static void __qla2x00_abort_all_cmds(struct qla_qpair *qp, int res) { - int cnt; + int cnt, status; unsigned long flags; srb_t *sp; scsi_qla_host_t *vha = qp->vha; @@ -1799,10 +1799,16 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) if (!sp_get(sp)) { spin_unlock_irqrestore (qp->qp_lock_ptr, flags); - qla2xxx_eh_abort( + status = qla2xxx_eh_abort( GET_CMD_SP(sp)); spin_lock_irqsave (qp->qp_lock_ptr, flags); + /* + * Get rid of extra reference caused + * by early exit from qla2xxx_eh_abort + */ + if (status == FAST_IO_FAIL) + atomic_dec(&sp->ref_count); } } sp->done(sp, res);