From patchwork Mon Jul 31 00:00:55 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Nicholas A. Bellinger" X-Patchwork-Id: 9870745 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 59BB260353 for ; Mon, 31 Jul 2017 00:01:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 413D02856A for ; Mon, 31 Jul 2017 00:01:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 360A228587; Mon, 31 Jul 2017 00:01:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9914E285BF for ; Mon, 31 Jul 2017 00:01:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751107AbdGaAA5 (ORCPT ); Sun, 30 Jul 2017 20:00:57 -0400 Received: from mail.linux-iscsi.org ([67.23.28.174]:43382 "EHLO linux-iscsi.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751099AbdGaAA4 (ORCPT ); Sun, 30 Jul 2017 20:00:56 -0400 Received: from [172.16.2.183] (unknown [50.225.59.10]) (using SSLv3 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: nab) by linux-iscsi.org (Postfix) with ESMTPSA id 4DF7340B19; Mon, 31 Jul 2017 00:06:39 +0000 (UTC) Message-ID: <1501459255.1499.71.camel@haakon3.daterainc.com> Subject: Re: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3643! (or :3319) From: "Nicholas A. Bellinger" To: =?UTF-8?Q?=C5=81ukasz?= Engel Cc: target-devel@vger.kernel.org Date: Sun, 30 Jul 2017 17:00:55 -0700 In-Reply-To: <0de08a82-cc9f-e073-a3b5-e7c10fc3e22c@softax.pl> References: <0de08a82-cc9f-e073-a3b5-e7c10fc3e22c@softax.pl> X-Mailer: Evolution 3.4.4-1 Mime-Version: 1.0 Sender: target-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: target-devel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi Lukasz, On Thu, 2017-07-27 at 17:23 +0200, Ɓukasz Engel wrote: > Hello > > I am experiencing problems using LIO FC target with vmware ESX (6.5). > > Thisis the same problem as discussed in > https://www.spinics.net/lists/target-devel/msg15670.html > I've tried 4.11.x (up to 12) and 4.12.3(and also some kernels from 4.9 > line). > > My setup is 3 ESX 6.5 hosts using FC target storage with 2x QLogic 2460 > (4gbit) (I also have another storage with QLE-2562 (8gbit)- same problem > occurs). > > I am able to reproduce the problem with almost 100%efficiency - > generating large i/o load by concurrently resuming many VMs - problem > always occurs within minute. > > * on clean4.12.3: > > BUG occurs on almost always first ABORT_TASK issued by vmware: > > [...] > Jul 27 17:06:40 teststorage [ 31.336673] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f80d6dc00 > Jul 27 17:06:40 teststorage [ 708.583638] Detected MISCOMPARE for addr: ffff881f8fb15000 buf: ffff881f7a46aa00 > Jul 27 17:06:40 teststorage [ 708.583651] Target/iblock: Send MISCOMPARE check condition and sense > Jul 27 17:06:42 teststorage [ 710.627503] ABORT_TASK: Found referenced qla2xxx task_tag: 1168472 > Jul 27 17:06:43 teststorage [ 711.454055] ------------[ cut here ]------------ > Jul 27 17:06:43 teststorage [ 711.454062] kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3643! > Jul 27 17:06:43 teststorage [ 711.454067] invalid opcode: 0000 [#1] SMP > Jul 27 17:06:43 teststorage [ 711.454070] Modules linked in: > tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file > target_core_iblock iscsi_target_mod target_core_mod netconsole lm90 vfat > fat intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul ses > crc32_pclmul enclosure ghash_clmulni_intel scsi_transport_sas iTCO_wdt > iTCO_vendor_support lpc_ich i2c_i801 mei_me mei tpm_tis tpm_tis_core tpm > nfsd auth_rpcgss nfs_acl lockd grace dm_multipath sunrpc binfmt_misc > nouveau video drm_kms_helper ttm raid1 drm qla2xxx igb e1000e mxm_wmi > aacraid dca ptp crc32c_intel pps_core scsi_transport_fc i2c_algo_bit wmi > Jul 27 17:06:43 teststorage [ 711.454101] CPU: 3 PID: 209 Comm: kworker/u24:7 Not tainted 4.12.3 #1 > Jul 27 17:06:43 teststorage [ 711.454104] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016 > Jul 27 17:06:43 teststorage [ 711.454118] Workqueue: tmr-iblock target_tmr_work [target_core_mod] > Jul 27 17:06:43 teststorage [ 711.454122] task: ffff881f82173b00 task.stack: ffffc9000d770000 > Jul 27 17:06:43 teststorage [ 711.454132] RIP: 0010:qlt_free_cmd+0x138/0x150 [qla2xxx] > Jul 27 17:06:43 teststorage [ 711.454135] RSP: 0018:ffffc9000d773d60 EFLAGS: 00010286 > Jul 27 17:06:43 teststorage [ 711.454138] RAX: 0000000000000088 RBX: ffff881f8ebb8bf8 RCX: ffffffffa019fade > Jul 27 17:06:43 teststorage [ 711.454142] RDX: 000000000000e074 RSI: ffff881f76fc87c0 RDI: 0000000000004000 > Jul 27 17:06:43 teststorage [ 711.454145] RBP: ffffc9000d773d80 R08: ffffffffa018c07c R09: ffff881f8ebb8bf8 > Jul 27 17:06:43 teststorage [ 711.454148] R10: ffffea007e076ac0 R11: ffff881f81dab640 R12: ffff881f80d6d800 > Jul 27 17:06:43 teststorage [ 711.454151] R13: ffff881f8ebb8c80 R14: 0000000000000286 R15: ffff881f8f5e2000 > Jul 27 17:06:43 teststorage [ 711.454155] FS: 0000000000000000(0000) GS:ffff881f9f2c0000(0000) knlGS:0000000000000000 > Jul 27 17:06:43 teststorage [ 711.454158] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Jul 27 17:06:43 teststorage [ 711.454161] CR2: 00007fff00782a68 CR3: 0000001f81e33000 CR4: 00000000003406e0 > Jul 27 17:06:43 teststorage [ 711.454164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > Jul 27 17:06:43 teststorage [ 711.454167] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Jul 27 17:06:43 teststorage [ 711.454170] Call Trace: > Jul 27 17:06:43 teststorage [ 711.454176] tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx] > Jul 27 17:06:43 teststorage [ 711.454183] target_put_sess_cmd+0xce/0x140 [target_core_mod] > Jul 27 17:06:43 teststorage [ 711.454190] core_tmr_abort_task+0x127/0x190 [target_core_mod] > Jul 27 17:06:43 teststorage [ 711.454197] target_tmr_work+0x111/0x120 [target_core_mod] > Jul 27 17:06:43 teststorage [ 711.454203] process_one_work+0x144/0x370 > Jul 27 17:06:43 teststorage [ 711.454206] worker_thread+0x4d/0x3c0 > Jul 27 17:06:43 teststorage [ 711.454210] kthread+0x109/0x140 > Jul 27 17:06:43 teststorage [ 711.454487] ? rescuer_thread+0x360/0x360 > Jul 27 17:06:43 teststorage [ 711.455029] ? kthread_park+0x60/0x60 > Jul 27 17:06:43 teststorage [ 711.455579] ret_from_fork+0x25/0x30 > Jul 27 17:06:43 teststorage [ 711.456122] Code: 7f a6 73 e1 0f b6 83 8c > 02 00 00 e9 54 ff ff ff 48 8b bb c0 02 00 00 48 89 de e8 04 fd ff ff 0f > b6 83 8c 02 00 00 e9 35 ff ff ff <0f> 0b 48 8b bb 90 02 00 00 e8 8a e2 > 0a e1 e9 44 ff ff ff 0f 0b > Jul 27 17:06:43 teststorage [ 711.457279] RIP: qlt_free_cmd+0x138/0x150 [qla2xxx] RSP: ffffc9000d773d60 > Jul 27 17:06:43 teststorage [ 711.457857] ---[ end trace 99e519b9a5b6591b ]--- > Yes, this is most certainly the same bug the patch here addresses: https://www.spinics.net/lists/target-devel/msg15759.html Also, a smaller version of that just went out with your CC' on it here: qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2) https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=6bcbb3174caa5f1ccc894f8ae077631659d5a629 > 8<----------- > > I have also tried patch from > https://www.spinics.net/lists/target-devel/msg15759.html > (4db6a8145940d0bbd10265020d681961ce2d3238 - currently not available in > git(?)) > > The results is - no BUGs, system survives a little longer, but it's a > matter of minutes when kworker hangs: > > Jul 26 15:21:33 teststorage [ 3588.629920] ABORT_TASK: Found referenced qla2xxx task_tag: 1202220 > Jul 26 15:21:34 teststorage [ 3589.491615] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1202220 > Jul 26 15:21:34 teststorage [ 3589.491700] ABORT_TASK: Found referenced qla2xxx task_tag: 1202264 > Jul 26 15:25:16 teststorage [ 3810.752583] INFO: task kworker/u24:2:2975 blocked for more than 120 seconds. > Jul 26 15:25:16 teststorage [ 3810.752672] Not tainted 4.12.3 #1 > Jul 26 15:25:16 teststorage [ 3810.752749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > Jul 26 15:25:16 teststorage [ 3810.752831] kworker/u24:2 D 0 2975 2 0x00000080 > Jul 26 15:25:16 teststorage [ 3810.752920] Workqueue: tmr-iblock target_tmr_work [target_core_mod] > Jul 26 15:25:16 teststorage [ 3810.753006] Call Trace: > Jul 26 15:25:16 teststorage [ 3810.753207] __schedule+0x285/0x840 > Jul 26 15:25:16 teststorage [ 3810.753410] schedule+0x36/0x80 > Jul 26 15:25:16 teststorage [ 3810.753623] schedule_timeout+0x242/0x2f0 > Jul 26 15:25:16 teststorage [ 3810.753832] ? radix_tree_lookup+0xd/0x10 > Jul 26 15:25:16 teststorage [ 3810.754046] ? get_work_pool+0x2d/0x40 > Jul 26 15:25:16 teststorage [ 3810.754263] ? flush_work+0x14d/0x190 > Jul 26 15:25:16 teststorage [ 3810.754486] wait_for_completion+0x111/0x170 > Jul 26 15:25:16 teststorage [ 3810.754717] ? wake_up_q+0x80/0x80 > Jul 26 15:25:16 teststorage [ 3810.754949] __transport_wait_for_tasks+0xa7/0x140 [target_core_mod] > Jul 26 15:25:16 teststorage [ 3810.755189] transport_wait_for_tasks+0x53/0x90 [target_core_mod] > Jul 26 15:25:16 teststorage [ 3810.755454] core_tmr_abort_task+0x10e/0x190 [target_core_mod] > Jul 26 15:25:16 teststorage [ 3810.755717] target_tmr_work+0x111/0x120 [target_core_mod] > Jul 26 15:25:16 teststorage [ 3810.755972] process_one_work+0x144/0x370 > Jul 26 15:25:16 teststorage [ 3810.756236] worker_thread+0x4d/0x3c0 > Jul 26 15:25:16 teststorage [ 3810.756507] kthread+0x109/0x140 > Jul 26 15:25:16 teststorage [ 3810.756785] ? rescuer_thread+0x360/0x360 > Jul 26 15:25:16 teststorage [ 3810.757060] ? kthread_park+0x60/0x60 > Jul 26 15:25:16 teststorage [ 3810.757348] ? do_syscall_64+0x67/0x150 > Jul 26 15:25:16 teststorage [ 3810.757636] ret_from_fork+0x25/0x30 > Jul 26 15:25:16 teststorage [ 3810.757927] NMI backtrace for cpu 3 > Jul 26 15:25:16 teststorage [ 3810.758220] CPU: 3 PID: 107 Comm: khungtaskd Not tainted 4.12.3 #1 > Jul 26 15:25:16 teststorage [ 3810.758530] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016 > This looks almost identical to what Pascal reported. Namely that target-core is waiting for a particular tag (1202264) during ABORT_TASK, but the backend never completes it back to target-core. As with Pascal, please apply the qla2xxx patch above along with the following debug patch, so we can verify if this is the case: > > (I have also tried 4.13-rc2, but currently I have some other problem > with initiating qla2xxx there ("qla2xxx 0000:09:00.0: can't allocate > MSI-X affinity masks for 2 vectors"), so I can't use FC target...) > > I am willing to provide more information if needed - It's currently my > out-of-production storage and so if anyone can look at this problem I > can on my side experiment and play with patches, retest, etc. > > Any help appreciated :) > Also, some more details on what your backend device storage behind target-core looks like would be helpful. Physical HBAs, device information, and any MD/RAID/LVM/etc on top.. --- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c index 0197635..0bd8c64 100644 --- a/drivers/target/target_core_transport.c +++ b/drivers/target/target_core_transport.c @@ -2827,7 +2827,13 @@ void transport_clear_lun_ref(struct se_lun *lun) spin_unlock_irqrestore(&cmd->t_state_lock, *flags); - wait_for_completion(&cmd->t_transport_stop_comp); + while (!wait_for_completion_timeout(&cmd->t_transport_stop_comp, + 180 * HZ)) { + pr_warn_ratelimited("se_cmd waiting CDB: 0x%02x tag: %llu t_state: %d" + "se_cmd_flags: 0x%08x transport_state: 0x%08x\n", + cmd->t_task_cdb[0], cmd->tag, cmd->t_state, + cmd->se_cmd_flags, cmd->transport_state); + } spin_lock_irqsave(&cmd->t_state_lock, *flags); cmd->transport_state &= ~(CMD_T_ACTIVE | CMD_T_STOP);