kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!

Hi Himanshu & Pascal,

On Mon, 2017-06-19 at 17:53 +0000, Madhani, Himanshu wrote:
> Hi Pascal, Nic, 
> 
> Sorry for delay in response. 
> 
> > On Jun 19, 2017, at 12:20 AM, Pascal de Bruijn <p.debruijn@unilogic.nl> wrote:
> > 
> > Hi,
> > 
> > As I've illustrated in a previous thread:
> > ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag
> > We've been experiencing some kernel lockups while using LIO as an FC target.
> > 
> > We've collected the kernel trace via serial, and this is the second lockup where we've been able to get proper kernel messages. The trace is very similar to the previous one, so the issue is practically reproducible for us, though thus far not at will. This happens once or twice a week for us, always at night:
> > 
> > Detected MISCOMPARE for addr: ffff91f3c55ac000 buf: ffff91f3d08dd600
> > Target/iblock: Send MISCOMPARE check condition and sense
> > ABORT_TASK: Found referenced qla2xxx task_tag: 1203652
> > ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1203652
> > ABORT_TASK: Found referenced qla2xxx task_tag: 1203700
> > ------------[ cut here ]------------
> > kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!
> > invalid opcode: 0000 [#1] SMP
> > Modules linked in: tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod configfs raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq md_mod ipmi_devintf ses enclosure sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp qla2xxx scsi_transport_fc mei_me mpt3sas igb mei ioatdma i2c_algo_bit dca wmi ipmi_si ipmi_msghandler
> > CPU: 13 PID: 24122 Comm: kworker/u34:0 Not tainted 4.9.30-ul5 #3
> > Hardware name: Supermicro SYS-1028U-E1CRTP+/X10DRU-i+, BIOS 2.0b 08/09/2016
> > Workqueue: tmr-iblock target_tmr_work [target_core_mod]
> > task: ffff91fbd4e42340 task.stack: ffffaba461684000
> > RIP: 0010:[<ffffffffc0625ba8>] [<ffffffffc0625ba8>] qlt_free_cmd+0x148/0x160 [qla2xxx]
> > RSP: 0018:ffffaba461687d30  EFLAGS: 00010202
> > RAX: 0000000000000000 RBX: ffff91fbcbd7cd48 RCX: ffffffffc064d692
> > RDX: 000000000000e074 RSI: ffff91fbd34c65a0 RDI: 0000000000004000
> > RBP: ffffaba461687d50 R08: ffffffffc063a7d0 R09: ffff91fbcbd7cd48
> > R10: ffff91f3d9822bc0 R11: ffff91f3ffff9f80 R12: ffff91fbce03b680
> > R13: 0000000000000246 R14: 0000000000125df4 R15: ffff91fbcb5ffac0
> > FS:  0000000000000000(0000) GS:ffff91fbffd40000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007f9b9397ef60 CR3: 0000000366407000 CR4: 00000000001406e0
> > Stack:
> >  ffff91fb0000048b ffff91fbcbd7cd48 ffff91fbcff9c870 0000000000000246
> >  ffffaba461687d60 ffffffffc0a49f2f ffffaba461687d88 ffffffffc09bf417
> >  ffff91fbcbd7cdd0 ffff91fbcbd7cd48 ffff91eee98b0000 ffffaba461687dd8
> > Call Trace:
> >  [<ffffffffc0a49f2f>] tcm_qla2xxx_release_cmd+0xf/0x20 [tcm_qla2xxx]
> >  [<ffffffffc09bf417>] target_put_sess_cmd+0xb7/0x120 [target_core_mod]
> >  [<ffffffffc09baa8a>] core_tmr_abort_task+0x12a/0x180 [target_core_mod]
> > [<ffffffffc09bcfc1>] target_tmr_work+0x111/0x120 [target_core_mod]
> >  [<ffffffff86c8fed1>] process_one_work+0x121/0x3c0
> >  [<ffffffff86c901b3>] worker_thread+0x43/0x4d0
> >  [<ffffffff86c90170>] ? process_one_work+0x3c0/0x3c0
> >  [<ffffffff86c90170>] ? process_one_work+0x3c0/0x3c0
> >  [<ffffffff86c959a4>] kthread+0xd4/0xf0
> >  [<ffffffff86c958d0>] ? kthread_worker_fn+0x100/0x100
> >  [<ffffffff8708f922>] ret_from_fork+0x22/0x30
> > Code: 99 a6 c6 0f b6 83 8c 02 00 00 e9 3f ff ff ff be 05 0d 00 00 48 c7 c7 e8 6e 64 c0 e8 53 69 65 c6 48 83 c4 08 5b 41 5c 41 5d 5d c3 <0f> 0b 0f 0b 48 8b bb 90 02 00 00 e8 58 14 71 c6 e9 2c ff ff ff
> > RIP  [<ffffffffc0625ba8>] qlt_free_cmd+0x148/0x160 [qla2xxx]
> > RSP <ffffaba461687d30>
> > ---[ end trace 82b53ac345806386 ]---
> > 
> > Given the trace above, it would seem this is an issue in the qla2xxx driver's target mode code, where the following bit seems particularly relevant:
> > 
> > linux-4.9.30/drivers/scsi/qla2xxx/qla_target.c
> > 
> > void qlt_free_cmd(struct qla_tgt_cmd *cmd)
> > {
> >        struct qla_tgt_sess *sess = cmd->sess;
> > 
> >        ql_dbg(ql_dbg_tgt, cmd->vha, 0xe074,
> >            "%s: se_cmd[%p] ox_id %04x\n",
> >            __func__, &cmd->se_cmd,
> > be16_to_cpu(cmd->atio.u.isp24.fcp_hdr.ox_id));
> > 
> >        BUG_ON(cmd->cmd_in_wq);
> > 
> > 
> > Can anybody confirm this is indeed an issue with the qla2xxx driver?
> > Is there any more information you'd need to be able to work the issue?
> > 
> > Regards,
> > Pascal de Bruijn
> > 
> > 
> 
> We are examining stack trace that was posted. However we’ll need help of Nicholas. 
> 
> Nic, 
> 
> This stack trace is very similar to the one we had reported last year as interaction issue between target stack and qla2xxx driver
>     
> you had added patches following patches to resolve this crash. (https://www.spinics.net/lists/target-devel/msg11691.html)
>     
>     
>     * | 310d3d314be7 target: Fix race with SCF_SEND_DELAYED_TAS handling
>     * | 0f4a943168f3 target: Fix remote-port TMR ABORT + se_cmd fabric stop
>     * | ebde1ca5a908 target: Fix TAS handling for multi-session se_node_acls
>     * | a6d9bb1c9605 target: Fix LUN_RESET active TMR descriptor handling
>     * | febe562c20df target: Fix LUN_RESET active I/O handling for ACK_KREF
>     
> Looks like in this environment we are still hitting same issue. 
>     
> Does this diff makes sense to add check for TRANSPORT_COMPLETE state so that qlt_free_cmd is not called from normal command path?
>     
> diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
> index 7a42bfd..35cecae 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -1815,7 +1815,13 @@ void __target_execute_cmd(struct se_cmd *cmd)
> ret = cmd->execute_cmd(cmd);
> if (ret) {
> spin_lock_irq(&cmd->t_state_lock);
> + cmd->t_state = TRANSPORT_COMPLETE;
> cmd->transport_state &= ~(CMD_T_BUSY|CMD_T_SENT);
> + cmd->transport_state |= CMD_T_COMPLETE;
> spin_unlock_irq(&cmd->t_state_lock);
>     
> transport_generic_request_failure(cmd, ret);
> 

After staring at this, there are two scenarios during ABORT_TASK where
this scenario could occur.

Namely, there are two cases where tcm_qla2xxx.c calls
tcm_qla2xxx_free_cmd() internally during ABORT_TASK:

  - During tcm_qla2xxx_handle_data_work() when cmd->aborted is detected.
  - During tcm_qla2xxx_aborted_task() when QLA_TGT_STATE_NEW or 
    QLA_TGT_STATE_DATA_IN are set.

My guess is the latter case is happening, that is resulting in
tcm_qla2xxx_free_cmd() queuing up qla_tgt_cmd->work from
transport_cmd_finish_abort() -> tcm_qla2xxx_aborted_task(), immediately
before target_put_sess_cmd() is called from core_tmr_abort_task() to
perform the final kref_put(&cmd->kref, ...).

So there is two things I'd like Pascal to try.

First, is to pick up one recent bug fix to address the possible
underflow for this scenario during ABORT_TASK:

target: Fix kref->refcount underflow in transport_cmd_finish_abort
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/target?id=73d4e580ccc5c3e05cea002f18111f66c9c07034

Note this is not directly related, but is still useful to have to avoid
the possible underflow of se_cmd->cmd_kref during this particular
scenario.  Pascal, let us know if you have any problems applying this
patch to v4.9.30.

The second is the following patch to catch this scenario in progress
during qlt_free_cmd(), and clear ->cmd_in-wq + dump the state before
doing a cancel_work_sync().

So I'd like Pascal to get this output before making further changes to
confirm what's really what's going on..

Beyond this patch, I'm starting to think the two cases above where
tcm_qla2xxx is calling tcm_qla2xxx_free_cmd() during TMR ABORT, are in
fact incorrect.

They are incorrect because in both cases target_core_tmr.c callers and
transport_cmd_finish_abort() are responsible for dropping any remaining
se_cmd->cmd_kref references, and releasing all se_cmd resources.

Having tcm_qla2xxx internal calls to tcm_qla2xxx_free_cmd() while TMR
CMD_T_ABORTED code has already quiesced se_cmd, and is in the process of
dropping se_cmd->cmd_kref references doesn't make sense to me.

That said, let's have Pascal verify with the two above patches, and if
we confirm it's scenario in question, then I'd like to consider dropping
the two cases in tcm_qla2xxx that invoke tcm_qla2xxx_free_cmd() during
CMD_T_ABORTED.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!

Commit Message

Comments

Patch