From patchwork Tue Jul 11 03:52:17 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Nicholas A. Bellinger" X-Patchwork-Id: 9834029 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id D941160363 for ; Tue, 11 Jul 2017 03:52:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C2E382853B for ; Tue, 11 Jul 2017 03:52:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B4943285D4; Tue, 11 Jul 2017 03:52:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 10F302853B for ; Tue, 11 Jul 2017 03:52:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754832AbdGKDwY (ORCPT ); Mon, 10 Jul 2017 23:52:24 -0400 Received: from mail.linux-iscsi.org ([67.23.28.174]:59099 "EHLO linux-iscsi.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754680AbdGKDwY (ORCPT ); Mon, 10 Jul 2017 23:52:24 -0400 Received: from [192.168.1.66] (75-37-194-224.lightspeed.lsatca.sbcglobal.net [75.37.194.224]) (using SSLv3 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: nab) by linux-iscsi.org (Postfix) with ESMTPSA id ACFCA40B19; Tue, 11 Jul 2017 03:57:10 +0000 (UTC) Message-ID: <1499745137.28145.15.camel@haakon3.daterainc.com> Subject: Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319! From: "Nicholas A. Bellinger" To: Pascal de Bruijn Cc: "Madhani, Himanshu" , target-devel , "Tran, Quinn" , s.kleijkers@unilogic.nl Date: Mon, 10 Jul 2017 20:52:17 -0700 In-Reply-To: <1499744492.28145.12.camel@haakon3.daterainc.com> References: <1498429881.26123.55.camel@haakon3.risingtidesystems.com> <0883cc0b-1cd8-66e7-e2d0-8bd9c0c76b9d@unilogic.nl> <1498807372.24896.45.camel@haakon3.risingtidesystems.com> <061954d8-6cfe-8b21-6e35-3f3c570b8c68@unilogic.nl> <1499401581.30628.16.camel@haakon3.risingtidesystems.com> <1499744492.28145.12.camel@haakon3.daterainc.com> X-Mailer: Evolution 3.4.4-1 Mime-Version: 1.0 Sender: target-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: target-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote: > On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote: > > On 07-07-17 06:26, Nicholas A. Bellinger wrote: > > > On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote: > > >> So abort_task can still be observed, but they do not result in a > > >> non-functional not-quite-PANICked machine anymore. > > > Thank alot for the bug report and your continuous testing to get this > > > resolved. The patch is queued up in target-pending/for-next with your > > > Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y > > > kernels get this fix as well. > > > > > > Thanks again. > > > > I'm afraid we may not be quite there yet after all... > > > > So we had the other two machines run an md check this weekend > > as well, again with a rediculously high synx_speed_max: > > > > Jul 9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0 > > Jul 9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. > > Jul 9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > > Jul 9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k. > > Jul 9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332 > > Jul 9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332 > > Jul 9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380 > > Jul 9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380 > > Jul 9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620 > > Jul 9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140 > > This means the outstanding I/O was located, but never completed back. > > Or, there is something else blocked waiting for completion like > outstanding qla2xxx WRITE transfer, before the abort may proceed. > > > Jul 9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > Jul 9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done. > > > > Btw, where there any hung task warnings while this was happening..? > > > The machine in question was still responsive (was accepting > > SSH logins), however it seemed VMware hosts weren't seeing > > the volume anymore (presumably due to the heavy IO on the backend). > > > > Also, several (12?) kworkers seemed stuck in a D state. > > > > Next time that happens, do a 'cat /proc/$PID/stack' to see where those > kworkers are blocked in un-interruptible sleep. > > > When my collegue tried to reboot the machine it got > > (presumably) stuck on > > > > /usr/bin/targetctl clear > > > > After which it was forcefully rebooted :) > > > > Sorry we don't have any more detailed info at this point. > > We haven't been able to reproduce this on a > > different machine yet :( > > Ok, please try with the following debug patch to verify if it's blocked > on qla_tgt_cmd->write_pending_abort_comp. > > diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > index 8c1bf9b..1199969 100644 > --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c > +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > @@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) > if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || > se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { > spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); > + printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag); > wait_for_completion(&cmd->write_pending_abort_comp); > return 0; > } > @@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) > * waiting upon completion in tcm_qla2xxx_write_pending_status() > */ > if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { > + u64 tag = cmd->se_cmd.tag; > + > + printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag); > complete(&cmd->write_pending_abort_comp); > + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); > return; > } > Whoops, mixed up the printk() output above.. Please use this one instead: --- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c index 8c1bf9b..01ffb8a 100644 --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c @@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) { struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd); unsigned long flags; + u64 tag = se_cmd->tag; + /* * Check for WRITE_PENDING status to determine if we need to wait for * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data(). @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); + printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag); wait_for_completion(&cmd->write_pending_abort_comp); + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); return 0; } spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) */ cmd->cmd_in_wq = 0; + if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { + printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n", + cmd->se_cmd.tag, cmd->write_data_transferred); + } + cmd->vha->tgt_counters.qla_core_ret_ctio++; if (!cmd->write_data_transferred) { /* @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) * waiting upon completion in tcm_qla2xxx_write_pending_status() */ if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { + printk("Completing write_pending_abort_comp for tag: %llu\n", + cmd->se_cmd.tag); complete(&cmd->write_pending_abort_comp); return; }