Message ID | 1499745137.28145.15.camel@haakon3.daterainc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Pascal, On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote: > On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote: > > On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote: > > > On 07-07-17 06:26, Nicholas A. Bellinger wrote: > > > > On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote: > > > >> So abort_task can still be observed, but they do not result in a > > > >> non-functional not-quite-PANICked machine anymore. > > > > Thank alot for the bug report and your continuous testing to get this > > > > resolved. The patch is queued up in target-pending/for-next with your > > > > Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y > > > > kernels get this fix as well. > > > > > > > > Thanks again. > > > > > > I'm afraid we may not be quite there yet after all... > > > > > > So we had the other two machines run an md check this weekend > > > as well, again with a rediculously high synx_speed_max: > > > > > > Jul 9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0 > > > Jul 9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. > > > Jul 9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. > > > Jul 9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k. > > > Jul 9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332 > > > Jul 9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332 > > > Jul 9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380 > > > Jul 9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380 > > > Jul 9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620 > > > Jul 9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140 > > > > This means the outstanding I/O was located, but never completed back. > > > > Or, there is something else blocked waiting for completion like > > outstanding qla2xxx WRITE transfer, before the abort may proceed. > > > > > Jul 9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. > > > Jul 9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done. > > > > > > > Btw, where there any hung task warnings while this was happening..? > > > > > The machine in question was still responsive (was accepting > > > SSH logins), however it seemed VMware hosts weren't seeing > > > the volume anymore (presumably due to the heavy IO on the backend). > > > > > > Also, several (12?) kworkers seemed stuck in a D state. > > > > > > > Next time that happens, do a 'cat /proc/$PID/stack' to see where those > > kworkers are blocked in un-interruptible sleep. > > > > > When my collegue tried to reboot the machine it got > > > (presumably) stuck on > > > > > > /usr/bin/targetctl clear > > > > > > After which it was forcefully rebooted :) > > > > > > Sorry we don't have any more detailed info at this point. > > > We haven't been able to reproduce this on a > > > different machine yet :( > > > > Ok, please try with the following debug patch to verify if it's blocked > > on qla_tgt_cmd->write_pending_abort_comp. > > > > diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > > index 8c1bf9b..1199969 100644 > > --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c > > +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > > @@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) > > if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || > > se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { > > spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); > > + printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag); > > wait_for_completion(&cmd->write_pending_abort_comp); > > return 0; > > } > > @@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) > > * waiting upon completion in tcm_qla2xxx_write_pending_status() > > */ > > if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { > > + u64 tag = cmd->se_cmd.tag; > > + > > + printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag); > > complete(&cmd->write_pending_abort_comp); > > + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); > > return; > > } > > > > Whoops, mixed up the printk() output above.. > > Please use this one instead: > > diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > index 8c1bf9b..01ffb8a 100644 > --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c > +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c > @@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) > { > struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd); > unsigned long flags; > + u64 tag = se_cmd->tag; > + > /* > * Check for WRITE_PENDING status to determine if we need to wait for > * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data(). > @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) > if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || > se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { > spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); > + printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag); > wait_for_completion(&cmd->write_pending_abort_comp); > + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); > return 0; > } > spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); > @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) > */ > cmd->cmd_in_wq = 0; > > + if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { > + printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n", > + cmd->se_cmd.tag, cmd->write_data_transferred); > + } > + > cmd->vha->tgt_counters.qla_core_ret_ctio++; > if (!cmd->write_data_transferred) { > /* > @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) > * waiting upon completion in tcm_qla2xxx_write_pending_status() > */ > if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { > + printk("Completing write_pending_abort_comp for tag: %llu\n", > + cmd->se_cmd.tag); > complete(&cmd->write_pending_abort_comp); > return; > } > Btw, since you're still observing issues with the original change, it has been reverted in target-pending/for-next for now until we get the second issue sorted out I'll plan to merge an updated version post v4.13-rc1 once we get a handle on what's going on. Thanks again. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11-07-17 19:52, Nicholas A. Bellinger wrote: > Hi Pascal, > > On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote: >> On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote: >>> On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote: >>>> On 07-07-17 06:26, Nicholas A. Bellinger wrote: >>>>> On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote: >>>>>> So abort_task can still be observed, but they do not result in a >>>>>> non-functional not-quite-PANICked machine anymore. >>>>> Thank alot for the bug report and your continuous testing to get this >>>>> resolved. The patch is queued up in target-pending/for-next with your >>>>> Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y >>>>> kernels get this fix as well. >>>>> >>>>> Thanks again. >>>> I'm afraid we may not be quite there yet after all... >>>> >>>> So we had the other two machines run an md check this weekend >>>> as well, again with a rediculously high synx_speed_max: >>>> >>>> Jul 9 04:00:01 myhost kernel: [661309.794774] md: data-check of RAID array md0 >>>> Jul 9 04:00:01 myhost kernel: [661309.799173] md: minimum _guaranteed_ speed: 10000 KB/sec/disk. >>>> Jul 9 04:00:01 myhost kernel: [661309.805219] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. >>>> Jul 9 04:00:01 myhost kernel: [661309.815194] md: using 128k window, over a total of 3252682752k. >>>> Jul 9 04:00:42 myhost kernel: [661351.076391] qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:02:01 myhost kernel: [661429.985082] qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:04:24 myhost kernel: [661573.395245] qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:04:57 myhost kernel: [661605.837694] qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:09:19 myhost kernel: [661868.261211] qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found referenced qla2xxx task_tag: 1175332 >>>> Jul 9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332 >>>> Jul 9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found referenced qla2xxx task_tag: 1175380 >>>> Jul 9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1175380 >>>> Jul 9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found referenced qla2xxx task_tag: 1175620 >>>> Jul 9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found referenced qla2xxx task_tag: 1211140 >>> This means the outstanding I/O was located, but never completed back. >>> >>> Or, there is something else blocked waiting for completion like >>> outstanding qla2xxx WRITE transfer, before the abort may proceed. >>> >>>> Jul 9 04:16:42 myhost kernel: [662310.617910] qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:18:00 myhost kernel: [662389.415853] qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:18:22 myhost kernel: [662411.066461] qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 04:20:23 myhost kernel: [662531.852833] qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, sending CHECK_CONDITION. >>>> Jul 9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check done. >>>> >>> Btw, where there any hung task warnings while this was happening..? >>> >>>> The machine in question was still responsive (was accepting >>>> SSH logins), however it seemed VMware hosts weren't seeing >>>> the volume anymore (presumably due to the heavy IO on the backend). >>>> >>>> Also, several (12?) kworkers seemed stuck in a D state. >>>> >>> Next time that happens, do a 'cat /proc/$PID/stack' to see where those >>> kworkers are blocked in un-interruptible sleep. >>> >>>> When my collegue tried to reboot the machine it got >>>> (presumably) stuck on >>>> >>>> /usr/bin/targetctl clear >>>> >>>> After which it was forcefully rebooted :) >>>> >>>> Sorry we don't have any more detailed info at this point. >>>> We haven't been able to reproduce this on a >>>> different machine yet :( >>> Ok, please try with the following debug patch to verify if it's blocked >>> on qla_tgt_cmd->write_pending_abort_comp. >>> >>> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>> index 8c1bf9b..1199969 100644 >>> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>> @@ -425,6 +425,7 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) >>> if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || >>> se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { >>> spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); >>> + printk("Completing write_pending_abort_comp for tag: %llu\n", se_cmd->tag); >>> wait_for_completion(&cmd->write_pending_abort_comp); >>> return 0; >>> } >>> @@ -515,7 +516,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) >>> * waiting upon completion in tcm_qla2xxx_write_pending_status() >>> */ >>> if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { >>> + u64 tag = cmd->se_cmd.tag; >>> + >>> + printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag); >>> complete(&cmd->write_pending_abort_comp); >>> + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); >>> return; >>> } >>> >> Whoops, mixed up the printk() output above.. >> >> Please use this one instead: >> >> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >> index 8c1bf9b..01ffb8a 100644 >> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c >> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >> @@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) >> { >> struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd); >> unsigned long flags; >> + u64 tag = se_cmd->tag; >> + >> /* >> * Check for WRITE_PENDING status to determine if we need to wait for >> * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data(). >> @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) >> if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || >> se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { >> spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); >> + printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag); >> wait_for_completion(&cmd->write_pending_abort_comp); >> + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); >> return 0; >> } >> spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); >> @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) >> */ >> cmd->cmd_in_wq = 0; >> >> + if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { >> + printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n", >> + cmd->se_cmd.tag, cmd->write_data_transferred); >> + } >> + >> cmd->vha->tgt_counters.qla_core_ret_ctio++; >> if (!cmd->write_data_transferred) { >> /* >> @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) >> * waiting upon completion in tcm_qla2xxx_write_pending_status() >> */ >> if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { >> + printk("Completing write_pending_abort_comp for tag: %llu\n", >> + cmd->se_cmd.tag); >> complete(&cmd->write_pending_abort_comp); >> return; >> } We still haven't been able to reliably reproduce, however we did come across this, that's very likely related. Essentially we're running 'echo check > /sys/block/md0/md/sync_action' throughout the night, with no practical sync_speed_max limit. Only in the morning presumably when I was doing stuff in VMware (I don't particularly recall what), VMware started reporting the volume as inaccessible. Here's the full dmesg, to provide maximum context (sorry for the long post): [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288 [Thu Jul 20 18:01:51 2017] Rounding down aligned max_sectors from 4294967295 to 4294967288 [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_dpo attribute [Thu Jul 20 18:01:51 2017] ignoring deprecated emulate_fua_read attribute [Thu Jul 20 18:01:51 2017] qla2xxx [0000:81:00.1]-00af:13: Performing ISP error recovery - ha=ffff9ebcd33c0000. [Thu Jul 20 18:01:53 2017] qla2xxx [0000:81:00.1]-500a:13: LOOP UP detected (4 Gbps). [Thu Jul 20 18:01:53 2017] qla2xxx [0000:82:00.0]-00af:14: Performing ISP error recovery - ha=ffff9ebcd3180000. [Thu Jul 20 18:01:55 2017] qla2xxx [0000:82:00.0]-500a:14: LOOP UP detected (4 Gbps). [Thu Jul 20 18:01:56 2017] qla2xxx [0000:82:00.1]-00af:15: Performing ISP error recovery - ha=ffff9ebcd3400000. [Thu Jul 20 18:01:57 2017] qla2xxx [0000:82:00.1]-500a:15: LOOP UP detected (4 Gbps). [Thu Jul 20 18:01:58 2017] qla2xxx [0000:81:00.0]-00af:2: Performing ISP error recovery - ha=ffff9ebcd4190000. [Thu Jul 20 18:01:59 2017] qla2xxx [0000:81:00.0]-500a:2: LOOP UP detected (4 Gbps). [Thu Jul 20 18:04:42 2017] md: data-check of RAID array md0 [Thu Jul 20 18:04:42 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Thu Jul 20 18:04:42 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Thu Jul 20 18:04:42 2017] md: using 128k window, over a total of 405798912k. [Thu Jul 20 18:40:14 2017] md: md0: data-check done. [Thu Jul 20 18:56:49 2017] md: data-check of RAID array md0 [Thu Jul 20 18:56:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Thu Jul 20 18:56:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Thu Jul 20 18:56:49 2017] md: using 128k window, over a total of 405798912k. [Thu Jul 20 19:34:16 2017] md: md0: data-check done. [Thu Jul 20 20:03:49 2017] md: data-check of RAID array md0 [Thu Jul 20 20:03:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Thu Jul 20 20:03:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Thu Jul 20 20:03:49 2017] md: using 128k window, over a total of 405798912k. [Thu Jul 20 20:04:51 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1223620 [Thu Jul 20 20:04:52 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1223620 [Thu Jul 20 20:08:00 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1237348 [Thu Jul 20 20:08:01 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1237348 [Thu Jul 20 20:40:04 2017] md: md0: data-check done. [Thu Jul 20 21:10:49 2017] md: data-check of RAID array md0 [Thu Jul 20 21:10:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Thu Jul 20 21:10:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Thu Jul 20 21:10:49 2017] md: using 128k window, over a total of 405798912k. [Thu Jul 20 21:51:29 2017] md: md0: data-check done. [Thu Jul 20 22:17:49 2017] md: data-check of RAID array md0 [Thu Jul 20 22:17:49 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Thu Jul 20 22:17:49 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Thu Jul 20 22:17:49 2017] md: using 128k window, over a total of 405798912k. [Thu Jul 20 22:53:37 2017] md: md0: data-check done. [Thu Jul 20 23:24:50 2017] md: data-check of RAID array md0 [Thu Jul 20 23:24:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Thu Jul 20 23:24:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Thu Jul 20 23:24:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 00:04:39 2017] md: md0: data-check done. [Fri Jul 21 00:31:50 2017] md: data-check of RAID array md0 [Fri Jul 21 00:31:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 00:31:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 00:31:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 01:09:07 2017] md: md0: data-check done. [Fri Jul 21 01:38:50 2017] md: data-check of RAID array md0 [Fri Jul 21 01:38:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 01:38:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 01:38:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 01:39:11 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1161652 [Fri Jul 21 01:39:12 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1161652 [Fri Jul 21 02:05:54 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1213252 [Fri Jul 21 02:05:55 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1213252 [Fri Jul 21 02:20:30 2017] md: md0: data-check done. [Fri Jul 21 02:45:50 2017] md: data-check of RAID array md0 [Fri Jul 21 02:45:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 02:45:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 02:45:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 03:08:14 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1243780 [Fri Jul 21 03:08:15 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1243780 [Fri Jul 21 03:28:21 2017] md: md0: data-check done. [Fri Jul 21 03:52:50 2017] md: data-check of RAID array md0 [Fri Jul 21 03:52:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 03:52:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 03:52:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 03:58:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1162612 [Fri Jul 21 03:58:34 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1162612 [Fri Jul 21 04:33:10 2017] md: md0: data-check done. [Fri Jul 21 04:59:50 2017] md: data-check of RAID array md0 [Fri Jul 21 04:59:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 04:59:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 04:59:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 05:40:34 2017] md: md0: data-check done. [Fri Jul 21 06:06:50 2017] md: data-check of RAID array md0 [Fri Jul 21 06:06:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 06:06:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 06:06:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 06:46:58 2017] md: md0: data-check done. [Fri Jul 21 07:13:50 2017] md: data-check of RAID array md0 [Fri Jul 21 07:13:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 07:13:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 07:13:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 07:59:38 2017] md: md0: data-check done. [Fri Jul 21 08:20:50 2017] md: data-check of RAID array md0 [Fri Jul 21 08:20:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 08:20:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 08:20:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 09:04:12 2017] md: md0: data-check done. [Fri Jul 21 09:27:50 2017] md: data-check of RAID array md0 [Fri Jul 21 09:27:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 09:27:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 09:27:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 10:09:28 2017] md: md0: data-check done. [Fri Jul 21 10:34:50 2017] md: data-check of RAID array md0 [Fri Jul 21 10:34:50 2017] md: minimum _guaranteed_ speed: 15000 KB/sec/disk. [Fri Jul 21 10:34:50 2017] md: using maximum available idle IO bandwidth (but not more than 1000000 KB/sec) for data-check. [Fri Jul 21 10:34:50 2017] md: using 128k window, over a total of 405798912k. [Fri Jul 21 10:46:22 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1247284 [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1247284 [Fri Jul 21 10:46:24 2017] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1247236 [Fri Jul 21 10:46:34 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1248148 [Fri Jul 21 10:49:04 2017] ABORT_TASK: Found referenced qla2xxx task_tag: 1175812 [Fri Jul 21 10:50:22 2017] INFO: task kworker/u34:2:4840 blocked for more than 180 seconds. [Fri Jul 21 10:50:22 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:50:22 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:50:22 2017] kworker/u34:2 D 0 4840 2 0x00000000 [Fri Jul 21 10:50:22 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod] [Fri Jul 21 10:50:22 2017] 0000000000000000 ffff9eb4d5331880 ffff9ebccaccd240 ffff9ebcffcd5080 [Fri Jul 21 10:50:22 2017] ffff9ebcdc5ca340 ffffb11c07d0fbe8 ffffffffa8092196 0000000000130b94 [Fri Jul 21 10:50:22 2017] ffff9ebccaccd240 7fffffffffffffff 0000000000000000 ffff9ebccaccd240 [Fri Jul 21 10:50:22 2017] Call Trace: [Fri Jul 21 10:50:22 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:50:22 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:50:22 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:50:22 2017] [<ffffffffa7cb5edd>] ? up+0x2d/0x50 [Fri Jul 21 10:50:22 2017] [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90 [Fri Jul 21 10:50:22 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:50:22 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:50:22 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:50:22 2017] [<ffffffffc02b0d9c>] __transport_wait_for_tasks+0xac/0x110 [target_core_mod] [Fri Jul 21 10:50:22 2017] [<ffffffffc02b0ece>] transport_wait_for_tasks+0x4e/0x80 [target_core_mod] [Fri Jul 21 10:50:22 2017] [<ffffffffc02aea79>] core_tmr_abort_task+0x119/0x190 [target_core_mod] [Fri Jul 21 10:50:22 2017] [<ffffffffc02b1011>] target_tmr_work+0x111/0x120 [target_core_mod] [Fri Jul 21 10:50:22 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:50:22 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:50:22 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:50:22 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:50:22 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:50:22 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:50:22 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:53:23 2017] INFO: task kworker/13:1:3063 blocked for more than 180 seconds. [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:53:23 2017] kworker/13:1 D 0 3063 2 0x00000000 [Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9ebccd9c2340 ffff9ebcffd55080 [Fri Jul 21 10:53:23 2017] ffff9ebcdc5cde00 ffffb11c2736fb28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:53:23 2017] ffff9ebccd9c2340 7fffffffffffffff 0000000000000000 ffff9ebccd9c2340 [Fri Jul 21 10:53:23 2017] Call Trace: [Fri Jul 21 10:53:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:53:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:53:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] [Fri Jul 21 10:53:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:53:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:53:23 2017] INFO: task kworker/2:1:20395 blocked for more than 180 seconds. [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:53:23 2017] kworker/2:1 D 0 20395 2 0x00000000 [Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9eb468940bc0 ffff9eb4dfc95080 [Fri Jul 21 10:53:23 2017] ffff9eb4dc689780 ffffb11c077e7b28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:53:23 2017] ffff9eb468940bc0 7fffffffffffffff 0000000000000000 ffff9eb468940bc0 [Fri Jul 21 10:53:23 2017] Call Trace: [Fri Jul 21 10:53:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:53:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:53:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] [Fri Jul 21 10:53:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:53:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:53:23 2017] INFO: task kworker/6:2:788 blocked for more than 180 seconds. [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:53:23 2017] kworker/6:2 D 0 788 2 0x00000000 [Fri Jul 21 10:53:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9ebcc9ce8000 ffff9ebcffc95080 [Fri Jul 21 10:53:23 2017] ffff9ebcdc5c9780 ffffb11c24e67b28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:53:23 2017] ffff9ebcc9ce8000 7fffffffffffffff 0000000000000000 ffff9ebcc9ce8000 [Fri Jul 21 10:53:23 2017] Call Trace: [Fri Jul 21 10:53:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:53:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:53:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] [Fri Jul 21 10:53:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:53:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:53:23 2017] INFO: task kworker/u33:0:3806 blocked for more than 180 seconds. [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:53:23 2017] kworker/u33:0 D 0 3806 2 0x00000000 [Fri Jul 21 10:53:23 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod] [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9eb4d5331880 ffff9eb468909780 ffff9eb4dfd15080 [Fri Jul 21 10:53:23 2017] ffff9eb4dc68e9c0 ffffb11c26d47be8 ffffffffa8092196 000000000011f104 [Fri Jul 21 10:53:23 2017] ffff9eb468909780 7fffffffffffffff 0000000000000000 ffff9eb468909780 [Fri Jul 21 10:53:23 2017] Call Trace: [Fri Jul 21 10:53:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:53:23 2017] [<ffffffffa7cb5edd>] ? up+0x2d/0x50 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:53:23 2017] [<ffffffffc02b0d9c>] __transport_wait_for_tasks+0xac/0x110 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc02b0ece>] transport_wait_for_tasks+0x4e/0x80 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc02aea79>] core_tmr_abort_task+0x119/0x190 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc02b1011>] target_tmr_work+0x111/0x120 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:53:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:53:23 2017] INFO: task kworker/u34:2:4840 blocked for more than 180 seconds. [Fri Jul 21 10:53:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:53:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:53:23 2017] kworker/u34:2 D 0 4840 2 0x00000000 [Fri Jul 21 10:53:23 2017] Workqueue: tmr-iblock target_tmr_work [target_core_mod] [Fri Jul 21 10:53:23 2017] 0000000000000000 ffff9eb4d5331880 ffff9ebccaccd240 ffff9ebcffcd5080 [Fri Jul 21 10:53:23 2017] ffff9ebcdc5ca340 ffffb11c07d0fbe8 ffffffffa8092196 0000000000130b94 [Fri Jul 21 10:53:23 2017] ffff9ebccaccd240 7fffffffffffffff 0000000000000000 ffff9ebccaccd240 [Fri Jul 21 10:53:23 2017] Call Trace: [Fri Jul 21 10:53:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:53:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:53:23 2017] [<ffffffffa7cb5edd>] ? up+0x2d/0x50 [Fri Jul 21 10:53:23 2017] [<ffffffffa7ce7b85>] ? irq_work_queue+0x85/0x90 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:53:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:53:23 2017] [<ffffffffc02b0d9c>] __transport_wait_for_tasks+0xac/0x110 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc02b0ece>] transport_wait_for_tasks+0x4e/0x80 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc02aea79>] core_tmr_abort_task+0x119/0x190 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffc02b1011>] target_tmr_work+0x111/0x120 [target_core_mod] [Fri Jul 21 10:53:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:53:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:53:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:56:23 2017] INFO: task kworker/13:1:3063 blocked for more than 180 seconds. [Fri Jul 21 10:56:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:56:23 2017] kworker/13:1 D 0 3063 2 0x00000000 [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9ebccd9c2340 ffff9ebcffd55080 [Fri Jul 21 10:56:23 2017] ffff9ebcdc5cde00 ffffb11c2736fb28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:56:23 2017] ffff9ebccd9c2340 7fffffffffffffff 0000000000000000 ffff9ebccd9c2340 [Fri Jul 21 10:56:23 2017] Call Trace: [Fri Jul 21 10:56:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:56:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:56:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:56:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:56:23 2017] INFO: task kworker/2:1:20395 blocked for more than 180 seconds. [Fri Jul 21 10:56:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:56:23 2017] kworker/2:1 D 0 20395 2 0x00000000 [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9eb468940bc0 ffff9eb4dfc95080 [Fri Jul 21 10:56:23 2017] ffff9eb4dc689780 ffffb11c077e7b28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:56:23 2017] ffff9eb468940bc0 7fffffffffffffff 0000000000000000 ffff9eb468940bc0 [Fri Jul 21 10:56:23 2017] Call Trace: [Fri Jul 21 10:56:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:56:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:56:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:56:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:56:23 2017] INFO: task kworker/10:2:28233 blocked for more than 180 seconds. [Fri Jul 21 10:56:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:56:23 2017] kworker/10:2 D 0 28233 2 0x00000000 [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9eb467848000 ffff9eb4dfd95080 [Fri Jul 21 10:56:23 2017] ffff9eb4dc68d240 ffffb11c0ffcbb28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:56:23 2017] ffff9eb467848000 7fffffffffffffff 0000000000000000 ffff9eb467848000 [Fri Jul 21 10:56:23 2017] Call Trace: [Fri Jul 21 10:56:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:56:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:56:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:56:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 [Fri Jul 21 10:56:23 2017] INFO: task kworker/6:2:788 blocked for more than 180 seconds. [Fri Jul 21 10:56:23 2017] Not tainted 4.9.38-ul5 #3 [Fri Jul 21 10:56:23 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Jul 21 10:56:23 2017] kworker/6:2 D 0 788 2 0x00000000 [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done [qla2xxx] [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 ffff9ebcc9ce8000 ffff9ebcffc95080 [Fri Jul 21 10:56:23 2017] ffff9ebcdc5c9780 ffffb11c24e67b28 ffffffffa8092196 ffffffffa7ca4a37 [Fri Jul 21 10:56:23 2017] ffff9ebcc9ce8000 7fffffffffffffff 0000000000000000 ffff9ebcc9ce8000 [Fri Jul 21 10:56:23 2017] Call Trace: [Fri Jul 21 10:56:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca4a37>] ? __enqueue_entity+0x67/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8094ed7>] schedule_timeout+0x147/0x190 [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca3faf>] ? sched_clock_cpu+0x8f/0xa0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9f06f>] ? check_preempt_curr+0x4f/0x80 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092f10>] wait_for_common+0xb0/0x170 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 [Fri Jul 21 10:56:23 2017] [<ffffffffa8092fe8>] wait_for_completion+0x18/0x20 [Fri Jul 21 10:56:23 2017] [<ffffffffc02b3901>] target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] [Fri Jul 21 10:56:23 2017] [<ffffffffc033afb0>] tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffc048d44d>] qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] [Fri Jul 21 10:56:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c93111>] process_one_work+0x121/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? process_one_work+0x3c0/0x3c0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98b10>] ? kthread_worker_fn+0x100/0x100 [Fri Jul 21 10:56:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 I hope this helps, let me know if you need anything from our end. Regards, Pascal de Bruijn -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 21-07-17 15:07, Pascal de Bruijn wrote: > On 11-07-17 19:52, Nicholas A. Bellinger wrote: >> Hi Pascal, >> >> On Mon, 2017-07-10 at 20:52 -0700, Nicholas A. Bellinger wrote: >>> On Mon, 2017-07-10 at 20:41 -0700, Nicholas A. Bellinger wrote: >>>> On Mon, 2017-07-10 at 17:04 +0200, Pascal de Bruijn wrote: >>>>> On 07-07-17 06:26, Nicholas A. Bellinger wrote: >>>>>> On Mon, 2017-07-03 at 16:03 +0200, Pascal de Bruijn wrote: >>>>>>> So abort_task can still be observed, but they do not result in a >>>>>>> non-functional not-quite-PANICked machine anymore. >>>>>> Thank alot for the bug report and your continuous testing to get >>>>>> this >>>>>> resolved. The patch is queued up in target-pending/for-next with >>>>>> your >>>>>> Tested-by, and will be CC' to stable so the older v4.x.y and v3.x.y >>>>>> kernels get this fix as well. >>>>>> >>>>>> Thanks again. >>>>> I'm afraid we may not be quite there yet after all... >>>>> >>>>> So we had the other two machines run an md check this weekend >>>>> as well, again with a rediculously high synx_speed_max: >>>>> >>>>> Jul 9 04:00:01 myhost kernel: [661309.794774] md: data-check of >>>>> RAID array md0 >>>>> Jul 9 04:00:01 myhost kernel: [661309.799173] md: minimum >>>>> _guaranteed_ speed: 10000 KB/sec/disk. >>>>> Jul 9 04:00:01 myhost kernel: [661309.805219] md: using maximum >>>>> available idle IO bandwidth (but not more than 1000000 KB/sec) for >>>>> data-check. >>>>> Jul 9 04:00:01 myhost kernel: [661309.815194] md: using 128k >>>>> window, over a total of 3252682752k. >>>>> Jul 9 04:00:42 myhost kernel: [661351.076391] >>>>> qla2xxx/21:00:00:24:ff:4b:8f:58: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:02:01 myhost kernel: [661429.985082] >>>>> qla2xxx/21:00:00:24:ff:4b:9e:19: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:04:24 myhost kernel: [661573.395245] >>>>> qla2xxx/50:01:43:80:28:ca:86:36: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:04:57 myhost kernel: [661605.837694] >>>>> qla2xxx/50:01:43:80:28:ca:86:e6: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:09:19 myhost kernel: [661868.261211] >>>>> qla2xxx/21:00:00:24:ff:54:9e:ab: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:13:17 myhost kernel: [662105.788459] ABORT_TASK: Found >>>>> referenced qla2xxx task_tag: 1175332 >>>>> Jul 9 04:13:17 myhost kernel: [662105.794794] ABORT_TASK: Sending >>>>> TMR_TASK_DOES_NOT_EXIST for ref_tag: 1175332 >>>>> Jul 9 04:13:17 myhost kernel: [662105.990584] ABORT_TASK: Found >>>>> referenced qla2xxx task_tag: 1175380 >>>>> Jul 9 04:13:18 myhost kernel: [662106.510403] ABORT_TASK: Sending >>>>> TMR_FUNCTION_COMPLETE for ref_tag: 1175380 >>>>> Jul 9 04:13:20 myhost kernel: [662108.988526] ABORT_TASK: Found >>>>> referenced qla2xxx task_tag: 1175620 >>>>> Jul 9 04:13:31 myhost kernel: [662119.684969] ABORT_TASK: Found >>>>> referenced qla2xxx task_tag: 1211140 >>>> This means the outstanding I/O was located, but never completed back. >>>> >>>> Or, there is something else blocked waiting for completion like >>>> outstanding qla2xxx WRITE transfer, before the abort may proceed. >>>> >>>>> Jul 9 04:16:42 myhost kernel: [662310.617910] >>>>> qla2xxx/21:00:00:24:ff:92:bf:43: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:18:00 myhost kernel: [662389.415853] >>>>> qla2xxx/21:00:00:24:ff:54:a1:33: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:18:22 myhost kernel: [662411.066461] >>>>> qla2xxx/21:00:00:24:ff:92:bf:59: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 04:20:23 myhost kernel: [662531.852833] >>>>> qla2xxx/21:00:00:24:ff:3c:d0:94: Unsupported SCSI Opcode 0x85, >>>>> sending CHECK_CONDITION. >>>>> Jul 9 07:00:28 myhost kernel: [672137.325166] md: md0: data-check >>>>> done. >>>>> >>>> Btw, where there any hung task warnings while this was happening..? >>>> >>>>> The machine in question was still responsive (was accepting >>>>> SSH logins), however it seemed VMware hosts weren't seeing >>>>> the volume anymore (presumably due to the heavy IO on the backend). >>>>> >>>>> Also, several (12?) kworkers seemed stuck in a D state. >>>>> >>>> Next time that happens, do a 'cat /proc/$PID/stack' to see where those >>>> kworkers are blocked in un-interruptible sleep. >>>> >>>>> When my collegue tried to reboot the machine it got >>>>> (presumably) stuck on >>>>> >>>>> /usr/bin/targetctl clear >>>>> >>>>> After which it was forcefully rebooted :) >>>>> >>>>> Sorry we don't have any more detailed info at this point. >>>>> We haven't been able to reproduce this on a >>>>> different machine yet :( >>>> Ok, please try with the following debug patch to verify if it's >>>> blocked >>>> on qla_tgt_cmd->write_pending_abort_comp. >>>> >>>> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>>> b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>>> index 8c1bf9b..1199969 100644 >>>> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>>> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>>> @@ -425,6 +425,7 @@ static int >>>> tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) >>>> if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || >>>> se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { >>>> spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); >>>> + printk("Completing write_pending_abort_comp for tag: >>>> %llu\n", se_cmd->tag); >>>> wait_for_completion(&cmd->write_pending_abort_comp); >>>> return 0; >>>> } >>>> @@ -515,7 +516,11 @@ static void >>>> tcm_qla2xxx_handle_data_work(struct work_struct *work) >>>> * waiting upon completion in >>>> tcm_qla2xxx_write_pending_status() >>>> */ >>>> if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { >>>> + u64 tag = cmd->se_cmd.tag; >>>> + >>>> + printk("Blocked on write_pending_abort_comp for tag: >>>> %llu\n", tag); >>>> complete(&cmd->write_pending_abort_comp); >>>> + printk("Awoke write_pending_abort_comp for tag: >>>> %llu\n", tag); >>>> return; >>>> } >>> Whoops, mixed up the printk() output above.. >>> >>> Please use this one instead: >>> >>> diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>> b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>> index 8c1bf9b..01ffb8a 100644 >>> --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>> +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c >>> @@ -417,6 +417,8 @@ static int >>> tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) >>> { >>> struct qla_tgt_cmd *cmd = container_of(se_cmd, struct >>> qla_tgt_cmd, se_cmd); >>> unsigned long flags; >>> + u64 tag = se_cmd->tag; >>> + >>> /* >>> * Check for WRITE_PENDING status to determine if we need to >>> wait for >>> * CTIO aborts to be posted via hardware in >>> tcm_qla2xxx_handle_data(). >>> @@ -425,7 +427,9 @@ static int >>> tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) >>> if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || >>> se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { >>> spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); >>> + printk("Blocked on write_pending_abort_comp for tag: >>> %llu\n", tag); >>> wait_for_completion(&cmd->write_pending_abort_comp); >>> + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); >>> return 0; >>> } >>> spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); >>> @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct >>> work_struct *work) >>> */ >>> cmd->cmd_in_wq = 0; >>> + if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { >>> + printk("Detected CMD_T_ABORTED from handle_data_work tag: >>> %llu %u\n", >>> + cmd->se_cmd.tag, cmd->write_data_transferred); >>> + } >>> + >>> cmd->vha->tgt_counters.qla_core_ret_ctio++; >>> if (!cmd->write_data_transferred) { >>> /* >>> @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct >>> work_struct *work) >>> * waiting upon completion in >>> tcm_qla2xxx_write_pending_status() >>> */ >>> if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { >>> + printk("Completing write_pending_abort_comp for tag: >>> %llu\n", >>> + cmd->se_cmd.tag); >>> complete(&cmd->write_pending_abort_comp); >>> return; >>> } > > We still haven't been able to reliably reproduce, however we did come > across this, that's very likely related. > > Essentially we're running 'echo check > /sys/block/md0/md/sync_action' > throughout the night, with no practical sync_speed_max limit. > > Only in the morning presumably when I was doing stuff in VMware (I > don't particularly recall what), VMware started reporting the volume > as inaccessible. > > Here's the full dmesg, to provide maximum context (sorry for the long > post): > ... SNIP ... > [Fri Jul 21 10:56:23 2017] INFO: task kworker/6:2:788 blocked for more > than 180 seconds. > [Fri Jul 21 10:56:23 2017] Not tainted 4.9.38-ul5 #3 > [Fri Jul 21 10:56:23 2017] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Fri Jul 21 10:56:23 2017] kworker/6:2 D 0 788 2 0x00000000 > [Fri Jul 21 10:56:23 2017] Workqueue: events qlt_free_session_done > [qla2xxx] > [Fri Jul 21 10:56:23 2017] 0000000000000000 ffff9ebcd4c38000 > ffff9ebcc9ce8000 ffff9ebcffc95080 > [Fri Jul 21 10:56:23 2017] ffff9ebcdc5c9780 ffffb11c24e67b28 > ffffffffa8092196 ffffffffa7ca4a37 > [Fri Jul 21 10:56:23 2017] ffff9ebcc9ce8000 7fffffffffffffff > 0000000000000000 ffff9ebcc9ce8000 > [Fri Jul 21 10:56:23 2017] Call Trace: > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092196>] ? __schedule+0x166/0x530 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca4a37>] ? > __enqueue_entity+0x67/0x70 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092591>] schedule+0x31/0x80 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8094ed7>] > schedule_timeout+0x147/0x190 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7ca3faf>] ? > sched_clock_cpu+0x8f/0xa0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9f06f>] ? > check_preempt_curr+0x4f/0x80 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092f10>] > wait_for_common+0xb0/0x170 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c9fdb0>] ? wake_up_q+0x70/0x70 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8092fe8>] > wait_for_completion+0x18/0x20 > [Fri Jul 21 10:56:23 2017] [<ffffffffc02b3901>] > target_wait_for_sess_cmds+0x41/0x110 [target_core_mod] > [Fri Jul 21 10:56:23 2017] [<ffffffffc033afb0>] > tcm_qla2xxx_free_session+0x40/0x80 [tcm_qla2xxx] > [Fri Jul 21 10:56:23 2017] [<ffffffffc048d44d>] > qlt_free_session_done+0x2cd/0x3d0 [qla2xxx] > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c92323>] ? insert_work+0x53/0xc0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c93111>] > process_one_work+0x121/0x3c0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933f3>] worker_thread+0x43/0x4d0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? > process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c933b0>] ? > process_one_work+0x3c0/0x3c0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98be4>] kthread+0xd4/0xf0 > [Fri Jul 21 10:56:23 2017] [<ffffffffa7c98b10>] ? > kthread_worker_fn+0x100/0x100 > [Fri Jul 21 10:56:23 2017] [<ffffffffa8096022>] ret_from_fork+0x22/0x30 > > I hope this helps, let me know if you need anything from our end. I forgot to clarify, so this was with 4.9.38, with the following patches applied: 1eb42f965cedafb700e9c902ddafb1c51e3117f7.patch 4db6a8145940d0bbd10265020d681961ce2d3238.patch linux-4.9.38-target-filter-0x85-scsi-opcde-warning.patch (see below) nab-debug.patch (the patch your asked me to apply) Regards, Pascal de Bruijn diff -Nurpd a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c --- a/drivers/target/target_core_transport.c 2017-07-15 12:17:55.000000000 +0200 +++ b/drivers/target/target_core_transport.c 2017-07-20 09:51:21.494390926 +0200 @@ -1337,7 +1337,7 @@ target_setup_cmd_from_cdb(struct se_cmd trace_target_sequencer_start(cmd); ret = dev->transport->parse_cdb(cmd); - if (ret == TCM_UNSUPPORTED_SCSI_OPCODE) + if (ret == TCM_UNSUPPORTED_SCSI_OPCODE && cmd->t_task_cdb[0] != 0x85) /* filter VMware ATA PASS THROUGH for SMART */ pr_warn_ratelimited("%s/%s: Unsupported SCSI Opcode 0x%02x, sending CHECK_CONDITION.\n", cmd->se_tfo->get_fabric_name(), cmd->se_sess->se_node_acl->initiatorname, -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/scsi/qla2xxx/tcm_qla2xxx.c b/drivers/scsi/qla2xxx/tcm_qla2xxx.c index 8c1bf9b..01ffb8a 100644 --- a/drivers/scsi/qla2xxx/tcm_qla2xxx.c +++ b/drivers/scsi/qla2xxx/tcm_qla2xxx.c @@ -417,6 +417,8 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) { struct qla_tgt_cmd *cmd = container_of(se_cmd, struct qla_tgt_cmd, se_cmd); unsigned long flags; + u64 tag = se_cmd->tag; + /* * Check for WRITE_PENDING status to determine if we need to wait for * CTIO aborts to be posted via hardware in tcm_qla2xxx_handle_data(). @@ -425,7 +427,9 @@ static int tcm_qla2xxx_write_pending_status(struct se_cmd *se_cmd) if (se_cmd->t_state == TRANSPORT_WRITE_PENDING || se_cmd->t_state == TRANSPORT_COMPLETE_QF_WP) { spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); + printk("Blocked on write_pending_abort_comp for tag: %llu\n", tag); wait_for_completion(&cmd->write_pending_abort_comp); + printk("Awoke write_pending_abort_comp for tag: %llu\n", tag); return 0; } spin_unlock_irqrestore(&se_cmd->t_state_lock, flags); @@ -508,6 +512,11 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) */ cmd->cmd_in_wq = 0; + if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { + printk("Detected CMD_T_ABORTED from handle_data_work tag: %llu %u\n", + cmd->se_cmd.tag, cmd->write_data_transferred); + } + cmd->vha->tgt_counters.qla_core_ret_ctio++; if (!cmd->write_data_transferred) { /* @@ -515,6 +524,8 @@ static void tcm_qla2xxx_handle_data_work(struct work_struct *work) * waiting upon completion in tcm_qla2xxx_write_pending_status() */ if (cmd->se_cmd.transport_state & CMD_T_ABORTED) { + printk("Completing write_pending_abort_comp for tag: %llu\n", + cmd->se_cmd.tag); complete(&cmd->write_pending_abort_comp); return; }