kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3643! (or :3319)

Hi Lukasz,

On Thu, 2017-07-27 at 17:23 +0200, Łukasz Engel wrote:
> Hello
> 
> I am experiencing problems using LIO FC target with vmware ESX (6.5).
> 
> Thisis the same problem as discussed in 
> https://www.spinics.net/lists/target-devel/msg15670.html
> I've tried 4.11.x (up to 12) and 4.12.3(and also some kernels from 4.9 
> line).
> 
> My setup is 3 ESX 6.5 hosts using FC target storage with 2x QLogic 2460 
> (4gbit) (I also have another storage with QLE-2562 (8gbit)- same problem 
> occurs).
> 
> I am able to reproduce the problem with almost 100%efficiency - 
> generating large i/o load by concurrently resuming many VMs - problem 
> always occurs within minute.
> 
> * on clean4.12.3:
> 
> BUG occurs on almost always first ABORT_TASK issued by vmware:
> 
> [...]
> Jul 27 17:06:40 teststorage [   31.336673] qla2xxx [0000:06:00.0]-ffff:12: qla24xx_do_nack_work create sess success ffff881f80d6dc00
> Jul 27 17:06:40 teststorage [  708.583638] Detected MISCOMPARE for addr: ffff881f8fb15000 buf: ffff881f7a46aa00
> Jul 27 17:06:40 teststorage [  708.583651] Target/iblock: Send MISCOMPARE check condition and sense
> Jul 27 17:06:42 teststorage [  710.627503] ABORT_TASK: Found referenced qla2xxx task_tag: 1168472
> Jul 27 17:06:43 teststorage [  711.454055] ------------[ cut here ]------------
> Jul 27 17:06:43 teststorage [  711.454062] kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3643!
> Jul 27 17:06:43 teststorage [  711.454067] invalid opcode: 0000 [#1] SMP
> Jul 27 17:06:43 teststorage [  711.454070] Modules linked in: 
> tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file 
> target_core_iblock iscsi_target_mod target_core_mod netconsole lm90 vfat 
> fat intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul ses 
> crc32_pclmul enclosure ghash_clmulni_intel scsi_transport_sas iTCO_wdt 
> iTCO_vendor_support lpc_ich i2c_i801 mei_me mei tpm_tis tpm_tis_core tpm 
> nfsd auth_rpcgss nfs_acl lockd grace dm_multipath sunrpc binfmt_misc 
> nouveau video drm_kms_helper ttm raid1 drm qla2xxx igb e1000e mxm_wmi 
> aacraid dca ptp crc32c_intel pps_core scsi_transport_fc i2c_algo_bit wmi
> Jul 27 17:06:43 teststorage [  711.454101] CPU: 3 PID: 209 Comm: kworker/u24:7 Not tainted 4.12.3 #1
> Jul 27 17:06:43 teststorage [  711.454104] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
> Jul 27 17:06:43 teststorage [  711.454118] Workqueue: tmr-iblock target_tmr_work [target_core_mod]
> Jul 27 17:06:43 teststorage [  711.454122] task: ffff881f82173b00 task.stack: ffffc9000d770000
> Jul 27 17:06:43 teststorage [  711.454132] RIP: 0010:qlt_free_cmd+0x138/0x150 [qla2xxx]
> Jul 27 17:06:43 teststorage [  711.454135] RSP: 0018:ffffc9000d773d60 EFLAGS: 00010286
> Jul 27 17:06:43 teststorage [  711.454138] RAX: 0000000000000088 RBX: ffff881f8ebb8bf8 RCX: ffffffffa019fade
> Jul 27 17:06:43 teststorage [  711.454142] RDX: 000000000000e074 RSI: ffff881f76fc87c0 RDI: 0000000000004000
> Jul 27 17:06:43 teststorage [  711.454145] RBP: ffffc9000d773d80 R08: ffffffffa018c07c R09: ffff881f8ebb8bf8
> Jul 27 17:06:43 teststorage [  711.454148] R10: ffffea007e076ac0 R11: ffff881f81dab640 R12: ffff881f80d6d800
> Jul 27 17:06:43 teststorage [  711.454151] R13: ffff881f8ebb8c80 R14: 0000000000000286 R15: ffff881f8f5e2000
> Jul 27 17:06:43 teststorage [  711.454155] FS: 0000000000000000(0000) GS:ffff881f9f2c0000(0000) knlGS:0000000000000000
> Jul 27 17:06:43 teststorage [  711.454158] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jul 27 17:06:43 teststorage [  711.454161] CR2: 00007fff00782a68 CR3: 0000001f81e33000 CR4: 00000000003406e0
> Jul 27 17:06:43 teststorage [  711.454164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Jul 27 17:06:43 teststorage [  711.454167] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Jul 27 17:06:43 teststorage [  711.454170] Call Trace:
> Jul 27 17:06:43 teststorage [  711.454176] tcm_qla2xxx_release_cmd+0x14/0x30 [tcm_qla2xxx]
> Jul 27 17:06:43 teststorage [  711.454183] target_put_sess_cmd+0xce/0x140 [target_core_mod]
> Jul 27 17:06:43 teststorage [  711.454190] core_tmr_abort_task+0x127/0x190 [target_core_mod]
> Jul 27 17:06:43 teststorage [  711.454197] target_tmr_work+0x111/0x120 [target_core_mod]
> Jul 27 17:06:43 teststorage [  711.454203] process_one_work+0x144/0x370
> Jul 27 17:06:43 teststorage [  711.454206] worker_thread+0x4d/0x3c0
> Jul 27 17:06:43 teststorage [  711.454210]  kthread+0x109/0x140
> Jul 27 17:06:43 teststorage [  711.454487]  ? rescuer_thread+0x360/0x360
> Jul 27 17:06:43 teststorage [  711.455029]  ? kthread_park+0x60/0x60
> Jul 27 17:06:43 teststorage [  711.455579] ret_from_fork+0x25/0x30
> Jul 27 17:06:43 teststorage [  711.456122] Code: 7f a6 73 e1 0f b6 83 8c 
> 02 00 00 e9 54 ff ff ff 48 8b bb c0 02 00 00 48 89 de e8 04 fd ff ff 0f 
> b6 83 8c 02 00 00 e9 35 ff ff ff <0f> 0b 48 8b bb 90 02 00 00 e8 8a e2 
> 0a e1 e9 44 ff ff ff 0f 0b
> Jul 27 17:06:43 teststorage [  711.457279] RIP: qlt_free_cmd+0x138/0x150 [qla2xxx] RSP: ffffc9000d773d60
> Jul 27 17:06:43 teststorage [  711.457857] ---[ end trace 99e519b9a5b6591b ]---
> 

Yes, this is most certainly the same bug the patch here addresses:

https://www.spinics.net/lists/target-devel/msg15759.html 

Also, a smaller version of that just went out with your CC' on it here:

qla2xxx: Fix incorrect tcm_qla2xxx_free_cmd use during TMR ABORT (v2)
https://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending.git/commit/?id=6bcbb3174caa5f1ccc894f8ae077631659d5a629

> 8<-----------
> 
> I have also tried patch from 
> https://www.spinics.net/lists/target-devel/msg15759.html 
> (4db6a8145940d0bbd10265020d681961ce2d3238 - currently not available in 
> git(?))
> 
> The results is - no BUGs, system survives a little longer, but it's a 
> matter of minutes when kworker hangs:
> 
> Jul 26 15:21:33 teststorage [ 3588.629920] ABORT_TASK: Found referenced qla2xxx task_tag: 1202220
> Jul 26 15:21:34 teststorage [ 3589.491615] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1202220
> Jul 26 15:21:34 teststorage [ 3589.491700] ABORT_TASK: Found referenced qla2xxx task_tag: 1202264
> Jul 26 15:25:16 teststorage [ 3810.752583] INFO: task kworker/u24:2:2975 blocked for more than 120 seconds.
> Jul 26 15:25:16 teststorage [ 3810.752672]       Not tainted 4.12.3 #1
> Jul 26 15:25:16 teststorage [ 3810.752749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jul 26 15:25:16 teststorage [ 3810.752831] kworker/u24:2 D    0  2975      2 0x00000080
> Jul 26 15:25:16 teststorage [ 3810.752920] Workqueue: tmr-iblock target_tmr_work [target_core_mod]
> Jul 26 15:25:16 teststorage [ 3810.753006] Call Trace:
> Jul 26 15:25:16 teststorage [ 3810.753207] __schedule+0x285/0x840
> Jul 26 15:25:16 teststorage [ 3810.753410] schedule+0x36/0x80
> Jul 26 15:25:16 teststorage [ 3810.753623] schedule_timeout+0x242/0x2f0
> Jul 26 15:25:16 teststorage [ 3810.753832]  ? radix_tree_lookup+0xd/0x10
> Jul 26 15:25:16 teststorage [ 3810.754046]  ? get_work_pool+0x2d/0x40
> Jul 26 15:25:16 teststorage [ 3810.754263]  ? flush_work+0x14d/0x190
> Jul 26 15:25:16 teststorage [ 3810.754486] wait_for_completion+0x111/0x170
> Jul 26 15:25:16 teststorage [ 3810.754717]  ? wake_up_q+0x80/0x80
> Jul 26 15:25:16 teststorage [ 3810.754949] __transport_wait_for_tasks+0xa7/0x140 [target_core_mod]
> Jul 26 15:25:16 teststorage [ 3810.755189] transport_wait_for_tasks+0x53/0x90 [target_core_mod]
> Jul 26 15:25:16 teststorage [ 3810.755454] core_tmr_abort_task+0x10e/0x190 [target_core_mod]
> Jul 26 15:25:16 teststorage [ 3810.755717] target_tmr_work+0x111/0x120 [target_core_mod]
> Jul 26 15:25:16 teststorage [ 3810.755972] process_one_work+0x144/0x370
> Jul 26 15:25:16 teststorage [ 3810.756236] worker_thread+0x4d/0x3c0
> Jul 26 15:25:16 teststorage [ 3810.756507] kthread+0x109/0x140
> Jul 26 15:25:16 teststorage [ 3810.756785]  ? rescuer_thread+0x360/0x360
> Jul 26 15:25:16 teststorage [ 3810.757060]  ? kthread_park+0x60/0x60
> Jul 26 15:25:16 teststorage [ 3810.757348]  ? do_syscall_64+0x67/0x150
> Jul 26 15:25:16 teststorage [ 3810.757636] ret_from_fork+0x25/0x30
> Jul 26 15:25:16 teststorage [ 3810.757927] NMI backtrace for cpu 3
> Jul 26 15:25:16 teststorage [ 3810.758220] CPU: 3 PID: 107 Comm: khungtaskd Not tainted 4.12.3 #1
> Jul 26 15:25:16 teststorage [ 3810.758530] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3402 11/14/2016
> 

This looks almost identical to what Pascal reported.

Namely that target-core is waiting for a particular tag (1202264) during
ABORT_TASK, but the backend never completes it back to target-core.

As with Pascal, please apply the qla2xxx patch above along with the
following debug patch, so we can verify if this is the case:

> 
> (I have also tried 4.13-rc2, but currently I have some other problem 
> with initiating qla2xxx there ("qla2xxx 0000:09:00.0: can't allocate 
> MSI-X affinity masks for 2 vectors"), so I can't use FC target...)
> 
> I am willing to provide more information if needed - It's currently my 
> out-of-production storage and so if anyone can look at this problem I 
> can on my side experiment and play with patches, retest, etc.
> 
> Any help appreciated :)
> 

Also, some more details on what your backend device storage behind
target-core looks like would be helpful.

Physical HBAs, device information, and any MD/RAID/LVM/etc on top..

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3643! (or :3319)

Commit Message

Comments

Patch