diff mbox

kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3319!

Message ID 1499745137.28145.15.camel@haakon3.daterainc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Nicholas A. Bellinger July 11, 2017, 3:52 a.m. UTC
On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote:
> On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote:
> > On 07-07-17 06:26, Nicholas A. Bellinger wrote:
> > > On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
> > >> So abort_task can still be observed, but they do not result in a
> > >> non-functional not-quite-PANICked machine anymore.
> > > Thank alot for the bug report and your continuous testing to get this
> > > resolved.  The patch is queued up in target-pending/for-next with your
> > > Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
> > > kernels get this fix as well.
> > >
> > > Thanks again.
> > 
> > I'm afraid we may not be quite there yet after all...
> > 
> > So we had the other two machines run an md check this weekend
> > as well, again with a rediculously high synx_speed_max:
> > 
> > Jul  9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0
> > Jul  9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
> > Jul  9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> > Jul  9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k.
> > Jul  9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332
> > Jul  9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332
> > Jul  9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380
> > Jul  9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380
> > Jul  9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620
> > Jul  9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140
> 
> This means the outstanding I/O was located, but never completed back.
> 
> Or, there is something else blocked waiting for completion like
> outstanding qla2xxx WRITE transfer, before the abort may proceed.
> 
> > Jul  9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > Jul  9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done.
> > 
> 
> Btw, where there any hung task warnings while this was happening..?
> 
> > The machine in question was still responsive (was accepting
> > SSH logins), however it seemed VMware hosts weren't seeing
> > the volume anymore (presumably due to the heavy IO on the backend).
> > 
> > Also, several (12?) kworkers seemed stuck in a D state.
> > 
> 
> Next time that happens, do a 'cat /proc/$PID/stack' to see where those
> kworkers are blocked in un-interruptible sleep.
> 
> > When my collegue tried to reboot the machine it got
> > (presumably) stuck on
> > 
> > /usr/bin/targetctl clear
> > 
> > After which it was forcefully rebooted :)
> > 
> > Sorry we don't have any more detailed info at this point.
> > We haven't been able to reproduce this on a
> > different machine yet :(
> 
> Ok, please try with the following debug patch to verify if it's blocked
> on qla_tgt_cmd->write_pending_abort_comp.
> 
> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> index 8c1bf9b..1199969 100644
> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> @@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>  	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
>  	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
>  		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> +		printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag);
>  		wait_for_completion(&cmd->write_pending_abort_comp);
>  		return 0;
>  	}
> @@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>  		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
>  		 */
>  		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
> +			u64 tag = cmd->se_cmd.tag;
> +
> +			printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
>  			complete(&cmd->write_pending_abort_comp);
> +			printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
>  			return;
>  		}
>  

Whoops, mixed up the printk() output above..

Please use this one instead:


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Nicholas A. Bellinger July 11, 2017, 5:52 p.m. UTC | #1
Hi Pascal,

On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote:
> On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote:
> > On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote:
> > > On 07-07-17 06:26, Nicholas A. Bellinger wrote:
> > > > On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
> > > >> So abort_task can still be observed, but they do not result in a
> > > >> non-functional not-quite-PANICked machine anymore.
> > > > Thank alot for the bug report and your continuous testing to get this
> > > > resolved.  The patch is queued up in target-pending/for-next with your
> > > > Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
> > > > kernels get this fix as well.
> > > >
> > > > Thanks again.
> > > 
> > > I'm afraid we may not be quite there yet after all...
> > > 
> > > So we had the other two machines run an md check this weekend
> > > as well, again with a rediculously high synx_speed_max:
> > > 
> > > Jul  9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0
> > > Jul  9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
> > > Jul  9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
> > > Jul  9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k.
> > > Jul  9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332
> > > Jul  9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332
> > > Jul  9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380
> > > Jul  9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380
> > > Jul  9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620
> > > Jul  9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140
> > 
> > This means the outstanding I/O was located, but never completed back.
> > 
> > Or, there is something else blocked waiting for completion like
> > outstanding qla2xxx WRITE transfer, before the abort may proceed.
> > 
> > > Jul  9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
> > > Jul  9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done.
> > > 
> > 
> > Btw, where there any hung task warnings while this was happening..?
> > 
> > > The machine in question was still responsive (was accepting
> > > SSH logins), however it seemed VMware hosts weren't seeing
> > > the volume anymore (presumably due to the heavy IO on the backend).
> > > 
> > > Also, several (12?) kworkers seemed stuck in a D state.
> > > 
> > 
> > Next time that happens, do a 'cat /proc/$PID/stack' to see where those
> > kworkers are blocked in un-interruptible sleep.
> > 
> > > When my collegue tried to reboot the machine it got
> > > (presumably) stuck on
> > > 
> > > /usr/bin/targetctl clear
> > > 
> > > After which it was forcefully rebooted :)
> > > 
> > > Sorry we don't have any more detailed info at this point.
> > > We haven't been able to reproduce this on a
> > > different machine yet :(
> > 
> > Ok, please try with the following debug patch to verify if it's blocked
> > on qla_tgt_cmd->write_pending_abort_comp.
> > 
> > diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> > index 8c1bf9b..1199969 100644
> > --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> > +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> > @@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
> >  	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
> >  	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
> >  		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> > +		printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag);
> >  		wait_for_completion(&cmd->write_pending_abort_comp);
> >  		return 0;
> >  	}
> > @@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
> >  		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
> >  		 */
> >  		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
> > +			u64 tag = cmd->se_cmd.tag;
> > +
> > +			printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
> >  			complete(&cmd->write_pending_abort_comp);
> > +			printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
> >  			return;
> >  		}
> >  
> 
> Whoops, mixed up the printk() output above..
> 
> Please use this one instead:
> 
> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> index 8c1bf9b..01ffb8a 100644
> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
> @@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>  {
>  	struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd);
>  	unsigned long flags;
> +	u64 tag = se_cmd->tag;
> +
>  	/*
>  	 * Check for WRITE_PENDING status to determine if we need to wait for
>  	 * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data().
> @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>  	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
>  	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
>  		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> +		printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
>  		wait_for_completion(&cmd->write_pending_abort_comp);
> +		printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
>  		return 0;
>  	}
>  	spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
> @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>  	 */
>  	cmd->cmd_in_wq = 0;
>  
> +	if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
> +		printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n",
> +			cmd->se_cmd.tag, cmd->write_data_transferred);
> +	}
> +
>  	cmd->vha->tgt_counters.qla_core_ret_ctio++;
>  	if (!cmd->write_data_transferred) {
>  		/*
> @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>  		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
>  		 */
>  		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
> +			printk("Completing write_pending_abort_comp for tag: %llu\n",
> +				cmd->se_cmd.tag);
>  			complete(&cmd->write_pending_abort_comp);
>  			return;
>  		}
> 

Btw, since you're still observing issues with the original change, it
has been reverted in target-pending/for-next for now until we get the
second issue sorted out

I'll plan to merge an updated version post v4.13-rc1 once we get a
handle on what's going on.

Thanks again.

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pascal de Bruijn | Unilogic B.V. July 21, 2017, 1:07 p.m. UTC | #2
On 11-07-17 19:52, Nicholas A. Bellinger wrote:
> Hi Pascal,
>
> On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote:
>> On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote:
>>> On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote:
>>>> On 07-07-17 06:26, Nicholas A. Bellinger wrote:
>>>>> On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
>>>>>> So abort_task can still be observed, but they do not result in a
>>>>>> non-functional not-quite-PANICked machine anymore.
>>>>> Thank alot for the bug report and your continuous testing to get this
>>>>> resolved.  The patch is queued up in target-pending/for-next with your
>>>>> Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
>>>>> kernels get this fix as well.
>>>>>
>>>>> Thanks again.
>>>> I'm afraid we may not be quite there yet after all...
>>>>
>>>> So we had the other two machines run an md check this weekend
>>>> as well, again with a rediculously high synx_speed_max:
>>>>
>>>> Jul  9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0
>>>> Jul  9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
>>>> Jul  9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check.
>>>> Jul  9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k.
>>>> Jul  9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332
>>>> Jul  9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332
>>>> Jul  9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380
>>>> Jul  9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380
>>>> Jul  9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620
>>>> Jul  9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140
>>> This means the outstanding I/O was located, but never completed back.
>>>
>>> Or, there is something else blocked waiting for completion like
>>> outstanding qla2xxx WRITE transfer, before the abort may proceed.
>>>
>>>> Jul  9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION.
>>>> Jul  9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done.
>>>>
>>> Btw, where there any hung task warnings while this was happening..?
>>>
>>>> The machine in question was still responsive (was accepting
>>>> SSH logins), however it seemed VMware hosts weren't seeing
>>>> the volume anymore (presumably due to the heavy IO on the backend).
>>>>
>>>> Also, several (12?) kworkers seemed stuck in a D state.
>>>>
>>> Next time that happens, do a 'cat /proc/$PID/stack' to see where those
>>> kworkers are blocked in un-interruptible sleep.
>>>
>>>> When my collegue tried to reboot the machine it got
>>>> (presumably) stuck on
>>>>
>>>> /usr/bin/targetctl clear
>>>>
>>>> After which it was forcefully rebooted :)
>>>>
>>>> Sorry we don't have any more detailed info at this point.
>>>> We haven't been able to reproduce this on a
>>>> different machine yet :(
>>> Ok, please try with the following debug patch to verify if it's blocked
>>> on qla_tgt_cmd->write_pending_abort_comp.
>>>
>>> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>> index 8c1bf9b..1199969 100644
>>> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>> @@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>>>   	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
>>>   	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
>>>   		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>> +		printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag);
>>>   		wait_for_completion(&cmd->write_pending_abort_comp);
>>>   		return 0;
>>>   	}
>>> @@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>>>   		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
>>>   		 */
>>>   		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
>>> +			u64 tag = cmd->se_cmd.tag;
>>> +
>>> +			printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
>>>   			complete(&cmd->write_pending_abort_comp);
>>> +			printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
>>>   			return;
>>>   		}
>>>   
>> Whoops, mixed up the printk() output above..
>>
>> Please use this one instead:
>>
>> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>> index 8c1bf9b..01ffb8a 100644
>> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>> @@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>>   {
>>   	struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd);
>>   	unsigned long flags;
>> +	u64 tag = se_cmd->tag;
>> +
>>   	/*
>>   	 * Check for WRITE_PENDING status to determine if we need to wait for
>>   	 * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data().
>> @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>>   	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
>>   	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
>>   		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>> +		printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
>>   		wait_for_completion(&cmd->write_pending_abort_comp);
>> +		printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
>>   		return 0;
>>   	}
>>   	spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>> @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>>   	 */
>>   	cmd->cmd_in_wq = 0;
>>   
>> +	if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
>> +		printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n",
>> +			cmd->se_cmd.tag, cmd->write_data_transferred);
>> +	}
>> +
>>   	cmd->vha->tgt_counters.qla_core_ret_ctio++;
>>   	if (!cmd->write_data_transferred) {
>>   		/*
>> @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
>>   		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
>>   		 */
>>   		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
>> +			printk("Completing write_pending_abort_comp for tag: %llu\n",
>> +				cmd->se_cmd.tag);
>>   			complete(&cmd->write_pending_abort_comp);
>>   			return;
>>   		}

We still haven't been able to reliably reproduce, however we did come 
across this, that's very likely related.

Essentially we're running 'echo check > /sys/block/md0/md/sync_action' 
throughout the night, with no practical sync_speed_max limit.

Only in the morning presumably when I was doing stuff in VMware (I don't 
particularly recall what), VMware started reporting the volume as 
inaccessible.

Here's the full dmesg, to provide maximum context (sorry for the long post):

[Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 
4294967295 to 4294967288
[Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 
4294967295 to 4294967288
[Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_dpo attribute
[Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_fua_read attribute
[Thu Jul 20 18:01:51 2017] qla2xxx [0000:81:00.1]-00af:13: Performing 
ISP error recovery - ha=ffff9ebcd33c0000.
[Thu Jul 20 18:01:53 2017] qla2xxx [0000:81:00.1]-500a:13: LOOP UP 
detected (4 Gbps).
[Thu Jul 20 18:01:53 2017] qla2xxx [0000:82:00.0]-00af:14: Performing 
ISP error recovery - ha=ffff9ebcd3180000.
[Thu Jul 20 18:01:55 2017] qla2xxx [0000:82:00.0]-500a:14: LOOP UP 
detected (4 Gbps).
[Thu Jul 20 18:01:56 2017] qla2xxx [0000:82:00.1]-00af:15: Performing 
ISP error recovery - ha=ffff9ebcd3400000.
[Thu Jul 20 18:01:57 2017] qla2xxx [0000:82:00.1]-500a:15: LOOP UP 
detected (4 Gbps).
[Thu Jul 20 18:01:58 2017] qla2xxx [0000:81:00.0]-00af:2: Performing ISP 
error recovery - ha=ffff9ebcd4190000.
[Thu Jul 20 18:01:59 2017] qla2xxx [0000:81:00.0]-500a:2: LOOP UP 
detected (4 Gbps).
[Thu Jul 20 18:04:42 2017] md: data-check of RAID array md0
[Thu Jul 20 18:04:42 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Thu Jul 20 18:04:42 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Thu Jul 20 18:04:42 2017] md: using 128k window, over a total of 
405798912k.
[Thu Jul 20 18:40:14 2017] md: md0: data-check done.
[Thu Jul 20 18:56:49 2017] md: data-check of RAID array md0
[Thu Jul 20 18:56:49 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Thu Jul 20 18:56:49 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Thu Jul 20 18:56:49 2017] md: using 128k window, over a total of 
405798912k.
[Thu Jul 20 19:34:16 2017] md: md0: data-check done.
[Thu Jul 20 20:03:49 2017] md: data-check of RAID array md0
[Thu Jul 20 20:03:49 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Thu Jul 20 20:03:49 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Thu Jul 20 20:03:49 2017] md: using 128k window, over a total of 
405798912k.
[Thu Jul 20 20:04:51 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1223620
[Thu Jul 20 20:04:52 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for 
ref_tag: 1223620
[Thu Jul 20 20:08:00 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1237348
[Thu Jul 20 20:08:01 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for 
ref_tag: 1237348
[Thu Jul 20 20:40:04 2017] md: md0: data-check done.
[Thu Jul 20 21:10:49 2017] md: data-check of RAID array md0
[Thu Jul 20 21:10:49 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Thu Jul 20 21:10:49 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Thu Jul 20 21:10:49 2017] md: using 128k window, over a total of 
405798912k.
[Thu Jul 20 21:51:29 2017] md: md0: data-check done.
[Thu Jul 20 22:17:49 2017] md: data-check of RAID array md0
[Thu Jul 20 22:17:49 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Thu Jul 20 22:17:49 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Thu Jul 20 22:17:49 2017] md: using 128k window, over a total of 
405798912k.
[Thu Jul 20 22:53:37 2017] md: md0: data-check done.
[Thu Jul 20 23:24:50 2017] md: data-check of RAID array md0
[Thu Jul 20 23:24:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Thu Jul 20 23:24:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Thu Jul 20 23:24:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 00:04:39 2017] md: md0: data-check done.
[Fri Jul 21 00:31:50 2017] md: data-check of RAID array md0
[Fri Jul 21 00:31:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 00:31:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 00:31:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 01:09:07 2017] md: md0: data-check done.
[Fri Jul 21 01:38:50 2017] md: data-check of RAID array md0
[Fri Jul 21 01:38:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 01:38:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 01:38:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 01:39:11 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1161652
[Fri Jul 21 01:39:12 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for 
ref_tag: 1161652
[Fri Jul 21 02:05:54 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1213252
[Fri Jul 21 02:05:55 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for 
ref_tag: 1213252
[Fri Jul 21 02:20:30 2017] md: md0: data-check done.
[Fri Jul 21 02:45:50 2017] md: data-check of RAID array md0
[Fri Jul 21 02:45:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 02:45:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 02:45:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 03:08:14 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1243780
[Fri Jul 21 03:08:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for 
ref_tag: 1243780
[Fri Jul 21 03:28:21 2017] md: md0: data-check done.
[Fri Jul 21 03:52:50 2017] md: data-check of RAID array md0
[Fri Jul 21 03:52:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 03:52:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 03:52:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 03:58:34 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1162612
[Fri Jul 21 03:58:34 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for 
ref_tag: 1162612
[Fri Jul 21 04:33:10 2017] md: md0: data-check done.
[Fri Jul 21 04:59:50 2017] md: data-check of RAID array md0
[Fri Jul 21 04:59:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 04:59:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 04:59:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 05:40:34 2017] md: md0: data-check done.
[Fri Jul 21 06:06:50 2017] md: data-check of RAID array md0
[Fri Jul 21 06:06:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 06:06:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 06:06:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 06:46:58 2017] md: md0: data-check done.
[Fri Jul 21 07:13:50 2017] md: data-check of RAID array md0
[Fri Jul 21 07:13:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 07:13:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 07:13:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 07:59:38 2017] md: md0: data-check done.
[Fri Jul 21 08:20:50 2017] md: data-check of RAID array md0
[Fri Jul 21 08:20:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 08:20:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 08:20:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 09:04:12 2017] md: md0: data-check done.
[Fri Jul 21 09:27:50 2017] md: data-check of RAID array md0
[Fri Jul 21 09:27:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 09:27:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 09:27:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 10:09:28 2017] md: md0: data-check done.
[Fri Jul 21 10:34:50 2017] md: data-check of RAID array md0
[Fri Jul 21 10:34:50 2017] md: minimum _guaranteed_  speed: 15000 
KB/sec/disk.
[Fri Jul 21 10:34:50 2017] md: using maximum available idle IO bandwidth 
(but not more than 1000000 KB/sec) for data-check.
[Fri Jul 21 10:34:50 2017] md: using 128k window, over a total of 
405798912k.
[Fri Jul 21 10:46:22 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1247284
[Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for 
ref_tag: 1247284
[Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST 
for ref_tag: 1247236
[Fri Jul 21 10:46:34 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1248148
[Fri Jul 21 10:49:04 2017] ABORT_TASK: Found referenced qla2xxx 
task_tag: 1175812
[Fri Jul 21 10:50:22 2017] INFO: task kworker/u34:2:4840 blocked for 
more than 180 seconds.
[Fri Jul 21 10:50:22 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:50:22 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:50:22 2017] kworker/u34:2   D    0  4840      2 0x00000000
[Fri Jul 21 10:50:22 2017] Workqueue: tmr-iblock target_tmr_work 
[target_core_mod]
[Fri Jul 21 10:50:22 2017]  0000000000000000 ffff9eb4d5331880 
ffff9ebccaccd240 ffff9ebcffcd5080
[Fri Jul 21 10:50:22 2017]  ffff9ebcdc5ca340 ffffb11c07d0fbe8 
ffffffffa8092196 0000000000130b94
[Fri Jul 21 10:50:22 2017]  ffff9ebccaccd240 7fffffffffffffff 
0000000000000000 ffff9ebccaccd240
[Fri Jul 21 10:50:22 2017] Call Trace:
[Fri Jul 21 10:50:22 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:50:22 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:50:22 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7cb5edd>] ? up+0x2d/0x50
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90
[Fri Jul 21 10:50:22 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:50:22 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:50:22 2017]  [<ffffffffc02b0d9c>] 
__transport_wait_for_tasks+0xac/0x110 [target_core_mod]
[Fri Jul 21 10:50:22 2017]  [<ffffffffc02b0ece>] 
transport_wait_for_tasks+0x4e/0x80 [target_core_mod]
[Fri Jul 21 10:50:22 2017]  [<ffffffffc02aea79>] 
core_tmr_abort_task+0x119/0x190 [target_core_mod]
[Fri Jul 21 10:50:22 2017]  [<ffffffffc02b1011>] 
target_tmr_work+0x111/0x120 [target_core_mod]
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:50:22 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:50:22 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:53:23 2017] INFO: task kworker/13:1:3063 blocked for more 
than 180 seconds.
[Fri Jul 21 10:53:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:53:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:53:23 2017] kworker/13:1    D    0  3063      2 0x00000000
[Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
[Fri Jul 21 10:53:23 2017]  0000000000000000 ffff9ebcd4c38000 
ffff9ebccd9c2340 ffff9ebcffd55080
[Fri Jul 21 10:53:23 2017]  ffff9ebcdc5cde00 ffffb11c2736fb28 
ffffffffa8092196 ffffffffa7ca4a37
[Fri Jul 21 10:53:23 2017]  ffff9ebccd9c2340 7fffffffffffffff 
0000000000000000 ffff9ebccd9c2340
[Fri Jul 21 10:53:23 2017] Call Trace:
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca4a37>] ? 
__enqueue_entity+0x67/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9f06f>] ? 
check_preempt_curr+0x4f/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b3901>] 
target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc033afb0>] 
tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc048d44d>] 
qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:53:23 2017] INFO: task kworker/2:1:20395 blocked for more 
than 180 seconds.
[Fri Jul 21 10:53:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:53:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:53:23 2017] kworker/2:1     D    0 20395      2 0x00000000
[Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
[Fri Jul 21 10:53:23 2017]  0000000000000000 ffff9ebcd4c38000 
ffff9eb468940bc0 ffff9eb4dfc95080
[Fri Jul 21 10:53:23 2017]  ffff9eb4dc689780 ffffb11c077e7b28 
ffffffffa8092196 ffffffffa7ca4a37
[Fri Jul 21 10:53:23 2017]  ffff9eb468940bc0 7fffffffffffffff 
0000000000000000 ffff9eb468940bc0
[Fri Jul 21 10:53:23 2017] Call Trace:
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca4a37>] ? 
__enqueue_entity+0x67/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9f06f>] ? 
check_preempt_curr+0x4f/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b3901>] 
target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc033afb0>] 
tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc048d44d>] 
qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:53:23 2017] INFO: task kworker/6:2:788 blocked for more 
than 180 seconds.
[Fri Jul 21 10:53:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:53:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:53:23 2017] kworker/6:2     D    0   788      2 0x00000000
[Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
[Fri Jul 21 10:53:23 2017]  0000000000000000 ffff9ebcd4c38000 
ffff9ebcc9ce8000 ffff9ebcffc95080
[Fri Jul 21 10:53:23 2017]  ffff9ebcdc5c9780 ffffb11c24e67b28 
ffffffffa8092196 ffffffffa7ca4a37
[Fri Jul 21 10:53:23 2017]  ffff9ebcc9ce8000 7fffffffffffffff 
0000000000000000 ffff9ebcc9ce8000
[Fri Jul 21 10:53:23 2017] Call Trace:
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca4a37>] ? 
__enqueue_entity+0x67/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9f06f>] ? 
check_preempt_curr+0x4f/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b3901>] 
target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc033afb0>] 
tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc048d44d>] 
qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:53:23 2017] INFO: task kworker/u33:0:3806 blocked for 
more than 180 seconds.
[Fri Jul 21 10:53:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:53:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:53:23 2017] kworker/u33:0   D    0  3806      2 0x00000000
[Fri Jul 21 10:53:23 2017] Workqueue: tmr-iblock target_tmr_work 
[target_core_mod]
[Fri Jul 21 10:53:23 2017]  0000000000000000 ffff9eb4d5331880 
ffff9eb468909780 ffff9eb4dfd15080
[Fri Jul 21 10:53:23 2017]  ffff9eb4dc68e9c0 ffffb11c26d47be8 
ffffffffa8092196 000000000011f104
[Fri Jul 21 10:53:23 2017]  ffff9eb468909780 7fffffffffffffff 
0000000000000000 ffff9eb468909780
[Fri Jul 21 10:53:23 2017] Call Trace:
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7cb5edd>] ? up+0x2d/0x50
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b0d9c>] 
__transport_wait_for_tasks+0xac/0x110 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b0ece>] 
transport_wait_for_tasks+0x4e/0x80 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02aea79>] 
core_tmr_abort_task+0x119/0x190 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b1011>] 
target_tmr_work+0x111/0x120 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:53:23 2017] INFO: task kworker/u34:2:4840 blocked for 
more than 180 seconds.
[Fri Jul 21 10:53:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:53:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:53:23 2017] kworker/u34:2   D    0  4840      2 0x00000000
[Fri Jul 21 10:53:23 2017] Workqueue: tmr-iblock target_tmr_work 
[target_core_mod]
[Fri Jul 21 10:53:23 2017]  0000000000000000 ffff9eb4d5331880 
ffff9ebccaccd240 ffff9ebcffcd5080
[Fri Jul 21 10:53:23 2017]  ffff9ebcdc5ca340 ffffb11c07d0fbe8 
ffffffffa8092196 0000000000130b94
[Fri Jul 21 10:53:23 2017]  ffff9ebccaccd240 7fffffffffffffff 
0000000000000000 ffff9ebccaccd240
[Fri Jul 21 10:53:23 2017] Call Trace:
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7cb5edd>] ? up+0x2d/0x50
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b0d9c>] 
__transport_wait_for_tasks+0xac/0x110 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b0ece>] 
transport_wait_for_tasks+0x4e/0x80 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02aea79>] 
core_tmr_abort_task+0x119/0x190 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffc02b1011>] 
target_tmr_work+0x111/0x120 [target_core_mod]
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:53:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:53:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:56:23 2017] INFO: task kworker/13:1:3063 blocked for more 
than 180 seconds.
[Fri Jul 21 10:56:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:56:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:56:23 2017] kworker/13:1    D    0  3063      2 0x00000000
[Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
[Fri Jul 21 10:56:23 2017]  0000000000000000 ffff9ebcd4c38000 
ffff9ebccd9c2340 ffff9ebcffd55080
[Fri Jul 21 10:56:23 2017]  ffff9ebcdc5cde00 ffffb11c2736fb28 
ffffffffa8092196 ffffffffa7ca4a37
[Fri Jul 21 10:56:23 2017]  ffff9ebccd9c2340 7fffffffffffffff 
0000000000000000 ffff9ebccd9c2340
[Fri Jul 21 10:56:23 2017] Call Trace:
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca4a37>] ? 
__enqueue_entity+0x67/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9f06f>] ? 
check_preempt_curr+0x4f/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:56:23 2017]  [<ffffffffc02b3901>] 
target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc033afb0>] 
tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc048d44d>] 
qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:56:23 2017] INFO: task kworker/2:1:20395 blocked for more 
than 180 seconds.
[Fri Jul 21 10:56:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:56:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:56:23 2017] kworker/2:1     D    0 20395      2 0x00000000
[Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
[Fri Jul 21 10:56:23 2017]  0000000000000000 ffff9ebcd4c38000 
ffff9eb468940bc0 ffff9eb4dfc95080
[Fri Jul 21 10:56:23 2017]  ffff9eb4dc689780 ffffb11c077e7b28 
ffffffffa8092196 ffffffffa7ca4a37
[Fri Jul 21 10:56:23 2017]  ffff9eb468940bc0 7fffffffffffffff 
0000000000000000 ffff9eb468940bc0
[Fri Jul 21 10:56:23 2017] Call Trace:
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca4a37>] ? 
__enqueue_entity+0x67/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9f06f>] ? 
check_preempt_curr+0x4f/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:56:23 2017]  [<ffffffffc02b3901>] 
target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc033afb0>] 
tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc048d44d>] 
qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:56:23 2017] INFO: task kworker/10:2:28233 blocked for 
more than 180 seconds.
[Fri Jul 21 10:56:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:56:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:56:23 2017] kworker/10:2    D    0 28233      2 0x00000000
[Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
[Fri Jul 21 10:56:23 2017]  0000000000000000 ffff9ebcd4c38000 
ffff9eb467848000 ffff9eb4dfd95080
[Fri Jul 21 10:56:23 2017]  ffff9eb4dc68d240 ffffb11c0ffcbb28 
ffffffffa8092196 ffffffffa7ca4a37
[Fri Jul 21 10:56:23 2017]  ffff9eb467848000 7fffffffffffffff 
0000000000000000 ffff9eb467848000
[Fri Jul 21 10:56:23 2017] Call Trace:
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca4a37>] ? 
__enqueue_entity+0x67/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9f06f>] ? 
check_preempt_curr+0x4f/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:56:23 2017]  [<ffffffffc02b3901>] 
target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc033afb0>] 
tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc048d44d>] 
qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
[Fri Jul 21 10:56:23 2017] INFO: task kworker/6:2:788 blocked for more 
than 180 seconds.
[Fri Jul 21 10:56:23 2017]       Not tainted 4.9.38-ul5 #3
[Fri Jul 21 10:56:23 2017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jul 21 10:56:23 2017] kworker/6:2     D    0   788      2 0x00000000
[Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx]
[Fri Jul 21 10:56:23 2017]  0000000000000000 ffff9ebcd4c38000 
ffff9ebcc9ce8000 ffff9ebcffc95080
[Fri Jul 21 10:56:23 2017]  ffff9ebcdc5c9780 ffffb11c24e67b28 
ffffffffa8092196 ffffffffa7ca4a37
[Fri Jul 21 10:56:23 2017]  ffff9ebcc9ce8000 7fffffffffffffff 
0000000000000000 ffff9ebcc9ce8000
[Fri Jul 21 10:56:23 2017] Call Trace:
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca4a37>] ? 
__enqueue_entity+0x67/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8094ed7>] 
schedule_timeout+0x147/0x190
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9f06f>] ? 
check_preempt_curr+0x4f/0x80
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092f10>] wait_for_common+0xb0/0x170
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8092fe8>] 
wait_for_completion+0x18/0x20
[Fri Jul 21 10:56:23 2017]  [<ffffffffc02b3901>] 
target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc033afb0>] 
tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffc048d44d>] 
qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c93111>] 
process_one_work+0x121/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
process_one_work+0x3c0/0x3c0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
[Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98b10>] ? 
kthread_worker_fn+0x100/0x100
[Fri Jul 21 10:56:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30

I hope this helps, let me know if you need anything from our end.

Regards,
Pascal de Bruijn


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pascal de Bruijn | Unilogic B.V. July 24, 2017, 2:50 p.m. UTC | #3
On 21-07-17 15:07, Pascal de Bruijn wrote:
> On 11-07-17 19:52, Nicholas A. Bellinger wrote:
>> Hi Pascal,
>>
>> On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote:
>>> On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote:
>>>> On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote:
>>>>> On 07-07-17 06:26, Nicholas A. Bellinger wrote:
>>>>>> On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote:
>>>>>>> So abort_task can still be observed, but they do not result in a
>>>>>>> non-functional not-quite-PANICked machine anymore.
>>>>>> Thank alot for the bug report and your continuous testing to get 
>>>>>> this
>>>>>> resolved.  The patch is queued up in target-pending/for-next with 
>>>>>> your
>>>>>> Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y
>>>>>> kernels get this fix as well.
>>>>>>
>>>>>> Thanks again.
>>>>> I'm afraid we may not be quite there yet after all...
>>>>>
>>>>> So we had the other two machines run an md check this weekend
>>>>> as well, again with a rediculously high synx_speed_max:
>>>>>
>>>>> Jul  9 04:00:01 myhost kernel: [661309.794774] md: data-check of 
>>>>> RAID array md0
>>>>> Jul  9 04:00:01 myhost kernel: [661309.799173] md: minimum 
>>>>> _guaranteed_  speed: 10000 KB/sec/disk.
>>>>> Jul  9 04:00:01 myhost kernel: [661309.805219] md: using maximum 
>>>>> available idle IO bandwidth (but not more than 1000000 KB/sec) for 
>>>>> data-check.
>>>>> Jul  9 04:00:01 myhost kernel: [661309.815194] md: using 128k 
>>>>> window, over a total of 3252682752k.
>>>>> Jul  9 04:00:42 myhost kernel: [661351.076391] 
>>>>> qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:02:01 myhost kernel: [661429.985082] 
>>>>> qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:04:24 myhost kernel: [661573.395245] 
>>>>> qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:04:57 myhost kernel: [661605.837694] 
>>>>> qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:09:19 myhost kernel: [661868.261211] 
>>>>> qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found 
>>>>> referenced qla2xxx task_tag: 1175332
>>>>> Jul  9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending 
>>>>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332
>>>>> Jul  9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found 
>>>>> referenced qla2xxx task_tag: 1175380
>>>>> Jul  9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending 
>>>>> TMR_FUNCTION_COMPLETE for ref_tag: 1175380
>>>>> Jul  9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found 
>>>>> referenced qla2xxx task_tag: 1175620
>>>>> Jul  9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found 
>>>>> referenced qla2xxx task_tag: 1211140
>>>> This means the outstanding I/O was located, but never completed back.
>>>>
>>>> Or, there is something else blocked waiting for completion like
>>>> outstanding qla2xxx WRITE transfer, before the abort may proceed.
>>>>
>>>>> Jul  9 04:16:42 myhost kernel: [662310.617910] 
>>>>> qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:18:00 myhost kernel: [662389.415853] 
>>>>> qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:18:22 myhost kernel: [662411.066461] 
>>>>> qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 04:20:23 myhost kernel: [662531.852833] 
>>>>> qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, 
>>>>> sending CHECK_CONDITION.
>>>>> Jul  9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check 
>>>>> done.
>>>>>
>>>> Btw, where there any hung task warnings while this was happening..?
>>>>
>>>>> The machine in question was still responsive (was accepting
>>>>> SSH logins), however it seemed VMware hosts weren't seeing
>>>>> the volume anymore (presumably due to the heavy IO on the backend).
>>>>>
>>>>> Also, several (12?) kworkers seemed stuck in a D state.
>>>>>
>>>> Next time that happens, do a 'cat /proc/$PID/stack' to see where those
>>>> kworkers are blocked in un-interruptible sleep.
>>>>
>>>>> When my collegue tried to reboot the machine it got
>>>>> (presumably) stuck on
>>>>>
>>>>> /usr/bin/targetctl clear
>>>>>
>>>>> After which it was forcefully rebooted :)
>>>>>
>>>>> Sorry we don't have any more detailed info at this point.
>>>>> We haven't been able to reproduce this on a
>>>>> different machine yet :(
>>>> Ok, please try with the following debug patch to verify if it's 
>>>> blocked
>>>> on qla_tgt_cmd->write_pending_abort_comp.
>>>>
>>>> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c 
>>>> b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>>> index 8c1bf9b..1199969 100644
>>>> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>>> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>>> @@ -425,6 +425,7 @@ static int 
>>>> tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>>>>       if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
>>>>           se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
>>>> spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>>> +        printk("Completing write_pending_abort_comp for tag: 
>>>> %llu\n", se_cmd->tag);
>>>> wait_for_completion(&cmd->write_pending_abort_comp);
>>>>           return 0;
>>>>       }
>>>> @@ -515,7 +516,11 @@ static void 
>>>> tcm_qla2xxx_handle_data_work(struct work_struct *work)
>>>>            * waiting upon completion in 
>>>> tcm_qla2xxx_write_pending_status()
>>>>            */
>>>>           if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
>>>> +            u64 tag = cmd->se_cmd.tag;
>>>> +
>>>> +            printk("Blocked on write_pending_abort_comp for tag: 
>>>> %llu\n", tag);
>>>> complete(&cmd->write_pending_abort_comp);
>>>> +            printk("Awoke write_pending_abort_comp for tag: 
>>>> %llu\n", tag);
>>>>               return;
>>>>           }
>>> Whoops, mixed up the printk() output above..
>>>
>>> Please use this one instead:
>>>
>>> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c 
>>> b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>> index 8c1bf9b..01ffb8a 100644
>>> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
>>> @@ -417,6 +417,8 @@ static int 
>>> tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>>>   {
>>>       struct qla_tgt_cmd *cmd = container_of(se_cmd, struct 
>>> qla_tgt_cmd, se_cmd);
>>>       unsigned long flags;
>>> +    u64 tag = se_cmd->tag;
>>> +
>>>       /*
>>>        * Check for WRITE_PENDING status to determine if we need to 
>>> wait for
>>>        * CTIO aborts to be posted via hardware in 
>>> tcm_qla2xxx_handle_data().
>>> @@ -425,7 +427,9 @@ static int 
>>> tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
>>>       if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
>>>           se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
>>>           spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>> +        printk("Blocked on write_pending_abort_comp for tag: 
>>> %llu\n", tag);
>>> wait_for_completion(&cmd->write_pending_abort_comp);
>>> +        printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
>>>           return 0;
>>>       }
>>>       spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
>>> @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct 
>>> work_struct *work)
>>>        */
>>>       cmd->cmd_in_wq = 0;
>>>   +    if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
>>> +        printk("Detected CMD_T_ABORTED from handle_data_work tag: 
>>> %llu %u\n",
>>> +            cmd->se_cmd.tag, cmd->write_data_transferred);
>>> +    }
>>> +
>>>       cmd->vha->tgt_counters.qla_core_ret_ctio++;
>>>       if (!cmd->write_data_transferred) {
>>>           /*
>>> @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct 
>>> work_struct *work)
>>>            * waiting upon completion in 
>>> tcm_qla2xxx_write_pending_status()
>>>            */
>>>           if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
>>> +            printk("Completing write_pending_abort_comp for tag: 
>>> %llu\n",
>>> +                cmd->se_cmd.tag);
>>>               complete(&cmd->write_pending_abort_comp);
>>>               return;
>>>           }
>
> We still haven't been able to reliably reproduce, however we did come 
> across this, that's very likely related.
>
> Essentially we're running 'echo check > /sys/block/md0/md/sync_action' 
> throughout the night, with no practical sync_speed_max limit.
>
> Only in the morning presumably when I was doing stuff in VMware (I 
> don't particularly recall what), VMware started reporting the volume 
> as inaccessible.
>
> Here's the full dmesg, to provide maximum context (sorry for the long 
> post):
>
... SNIP ...
> [Fri Jul 21 10:56:23 2017] INFO: task kworker/6:2:788 blocked for more 
> than 180 seconds.
> [Fri Jul 21 10:56:23 2017]       Not tainted 4.9.38-ul5 #3
> [Fri Jul 21 10:56:23 2017] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Fri Jul 21 10:56:23 2017] kworker/6:2     D    0   788      2 0x00000000
> [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done 
> [qla2xxx]
> [Fri Jul 21 10:56:23 2017]  0000000000000000 ffff9ebcd4c38000 
> ffff9ebcc9ce8000 ffff9ebcffc95080
> [Fri Jul 21 10:56:23 2017]  ffff9ebcdc5c9780 ffffb11c24e67b28 
> ffffffffa8092196 ffffffffa7ca4a37
> [Fri Jul 21 10:56:23 2017]  ffff9ebcc9ce8000 7fffffffffffffff 
> 0000000000000000 ffff9ebcc9ce8000
> [Fri Jul 21 10:56:23 2017] Call Trace:
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092196>] ? __schedule+0x166/0x530
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca4a37>] ? 
> __enqueue_entity+0x67/0x70
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092591>] schedule+0x31/0x80
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8094ed7>] 
> schedule_timeout+0x147/0x190
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7ca3faf>] ? 
> sched_clock_cpu+0x8f/0xa0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9f06f>] ? 
> check_preempt_curr+0x4f/0x80
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092f10>] 
> wait_for_common+0xb0/0x170
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8092fe8>] 
> wait_for_completion+0x18/0x20
> [Fri Jul 21 10:56:23 2017]  [<ffffffffc02b3901>] 
> target_wait_for_sess_cmds+0x41/0x110 [target_core_mod]
> [Fri Jul 21 10:56:23 2017]  [<ffffffffc033afb0>] 
> tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx]
> [Fri Jul 21 10:56:23 2017]  [<ffffffffc048d44d>] 
> qlt_free_session_done+0x2cd/0x3d0 [qla2xxx]
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c92323>] ? insert_work+0x53/0xc0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c93111>] 
> process_one_work+0x121/0x3c0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
> process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c933b0>] ? 
> process_one_work+0x3c0/0x3c0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98be4>] kthread+0xd4/0xf0
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa7c98b10>] ? 
> kthread_worker_fn+0x100/0x100
> [Fri Jul 21 10:56:23 2017]  [<ffffffffa8096022>] ret_from_fork+0x22/0x30
>
> I hope this helps, let me know if you need anything from our end.

I forgot to clarify, so this was with 4.9.38, with the following patches 
applied:

1eb42f965cedafb700e9c902ddafb1c51e3117f7.patch
4db6a8145940d0bbd10265020d681961ce2d3238.patch
linux-4.9.38-target-filter-0x85-scsi-opcde-warning.patch (see below)
nab-debug.patch (the patch your asked me to apply)

Regards,
Pascal de Bruijn


diff -Nurpd a/drivers/target/target_core_transport.c 
b/drivers/target/target_core_transport.c
--- a/drivers/target/target_core_transport.c    2017-07-15 
12:17:55.000000000 +0200
+++ b/drivers/target/target_core_transport.c    2017-07-20 
09:51:21.494390926 +0200
@@ -1337,7 +1337,7 @@ target_setup_cmd_from_cdb(struct se_cmd
      trace_target_sequencer_start(cmd);

      ret = dev->transport->parse_cdb(cmd);
-    if (ret == TCM_UNSUPPORTED_SCSI_OPCODE)
+    if (ret == TCM_UNSUPPORTED_SCSI_OPCODE && cmd->t_task_cdb[0] != 
0x85) /* filter VMware ATA PASS THROUGH for SMART */
          pr_warn_ratelimited("%s/%s: Unsupported SCSI Opcode 0x%02x, 
sending CHECK_CONDITION.\n",
                      cmd->se_tfo->get_fabric_name(),
cmd->se_sess->se_node_acl->initiatorname,


--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
index 8c1bf9b..01ffb8a 100644
--- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c
+++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c
@@ -417,6 +417,8 @@  static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
 {
 	struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd);
 	unsigned long flags;
+	u64 tag = se_cmd->tag;
+
 	/*
 	 * Check for WRITE_PENDING status to determine if we need to wait for
 	 * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data().
@@ -425,7 +427,9 @@  static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd)
 	if (se_cmd->t_state == TRANSPORT_WRITE_PENDING ||
 	    se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) {
 		spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
+		printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag);
 		wait_for_completion(&cmd->write_pending_abort_comp);
+		printk("Awoke write_pending_abort_comp for tag: %llu\n", tag);
 		return 0;
 	}
 	spin_unlock_irqrestore(&se_cmd->t_state_lock, flags);
@@ -508,6 +512,11 @@  static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
 	 */
 	cmd->cmd_in_wq = 0;
 
+	if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
+		printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n",
+			cmd->se_cmd.tag, cmd->write_data_transferred);
+	}
+
 	cmd->vha->tgt_counters.qla_core_ret_ctio++;
 	if (!cmd->write_data_transferred) {
 		/*
@@ -515,6 +524,8 @@  static void tcm_qla2xxx_handle_data_work(struct work_struct *work)
 		 * waiting upon completion in tcm_qla2xxx_write_pending_status()
 		 */
 		if (cmd->se_cmd.transport_state & CMD_T_ABORTED) {
+			printk("Completing write_pending_abort_comp for tag: %llu\n",
+				cmd->se_cmd.tag);
 			complete(&cmd->write_pending_abort_comp);
 			return;
 		}