diff mbox

[RFC] Clear out stuck ops to prevent iSER from going init D state

Message ID CAANLjFpi2E-2UxS3YSgjN=GzyZdLBhqJa4LN++52HTw6_3DKJA@mail.gmail.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Robert LeBlanc Jan. 23, 2017, 7:01 p.m. UTC
In certain circumstances the RDMA connection can be abruptly
terminated, but something is getting stuck preventing the iSCSI clean
up commands from being completed. Just removing the isert_wait4*
commands isn't enough. Just resetting the queue pair isn't enough
either. This patch allows the session to be renegotiated and the iSCSI
process never goes into D state. I usually get iSCSI session errors
because they are not being cleaned up properly (obviously). I need
some help getting this patch fixed right as resetting the queue pair
is probably not the right approach and overkill to solving the
problem. I think it at least shows where the problem is occurring and
how I can get around it.

The problem easily shows up with two ConnectX-4-LX card connected to a
10 Gb switch. The target is a RAM disk and the initiator just mounts
it as ext4 and runs fio. During the lay down of the files, the
connection disruption causes the indefinite D state usually within the
first 4 GB. We have also experienced a very similar backtrace of the D
state processes on our Infiniband hardware following abrupt connection
losses (power loss to target) and a reinstatement of sessions where
the session information is not the same (we did not use targetcli to
save/restore exports, instead using a script to export causing an out
of order problem). We are now using targetcli to save/restore now and
the D state problem doesn't occur nearly as often, but we are
concerned that something like this could put the target in D state
requiring a reboot. Since we want to move to RoCE and the problem is
much easier to trigger there, we really need a fix.

I hope someone can provide some direction in this regard.

Here is a sample of the iSCSI errors with this patch.
----

[ 292.444044] ------------[ cut here ]------------
[ 292.444045] WARNING: CPU: 26 PID: 12705 at lib/list_debug.c:59
__list_del_entry+0xa1/0xd0
[ 292.444046] list_del corruption. prev->next should be
ffff8865628c27c0, but was dead000000000100
[ 292.444057] Modules linked in: ib_isert rdma_cm iw_cm ib_cm
target_core_user target_core_pscsi target_core_file target_core_iblock
mlx5_ib ib_core dm_mod 8021q garp mrp iptable_filter sb_edac edac_core
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ext4
ipmi_devintf irqbypass crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel aesni_intel lrw jbd2 gf128mul mbcache mei_me
glue_helper iTCO_wdt ablk_helper cryptd iTCO_vendor_support mei joydev
sg ioatdma shpchp pcspkr i2c_i801 lpc_ich mfd_core i2c_smbus acpi_pad
wmi ipmi_si ipmi_msghandler acpi_power_meter ip_tables xfs libcrc32c
raid1 sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm mlx5_core igb ahci ptp drm libahci pps_core mlx4_core
libata dca i2c_algo_bit be2iscsi bnx2i cnic uio qla4xxx
iscsi_boot_sysfs
[ 292.444058] CPU: 26 PID: 12705 Comm: kworker/26:2 Tainted: G W 4.9.0+ #14
[ 292.444058] Hardware name: Supermicro SYS-6028TP-HTFR/X10DRT-PIBF,
BIOS 1.1 08/03/2015
[ 292.444059] Workqueue: target_completion target_complete_ok_work
[ 292.444060] ffffc90035533ca0 ffffffff8134d45f ffffc90035533cf0
0000000000000000
[ 292.444061] ffffc90035533ce0 ffffffff81083371 0000003b00000202
ffff8865628c27c0
[ 292.444062] ffff887f25f48064 0000000000000001 0000000000000000
0000000000000680
[ 292.444062] Call Trace:
[ 292.444063] [<ffffffff8134d45f>] dump_stack+0x63/0x84
[ 292.444065] [<ffffffff81083371>] __warn+0xd1/0xf0
[ 292.444066] [<ffffffff810833ef>] warn_slowpath_fmt+0x5f/0x80
[ 292.444067] [<ffffffff8136cce1>] __list_del_entry+0xa1/0xd0
[ 292.444067] [<ffffffff8136cd1d>] list_del+0xd/0x30
[ 292.444069] [<ffffffff8150a724>] target_remove_from_state_list+0x64/0x70
[ 292.444070] [<ffffffff8150a829>] transport_cmd_check_stop+0xf9/0x110
[ 292.444071] [<ffffffff8150e6c9>] target_complete_ok_work+0x169/0x360
[ 292.444072] [<ffffffff8109cc02>] process_one_work+0x152/0x400
[ 292.444072] [<ffffffff8109d4f5>] worker_thread+0x125/0x4b0
[ 292.444073] [<ffffffff8109d3d0>] ? rescuer_thread+0x380/0x380
[ 292.444075] [<ffffffff810a3059>] kthread+0xd9/0xf0
[ 292.444076] [<ffffffff810a2f80>] ? kthread_park+0x60/0x60
[ 292.444077] [<ffffffff817732d5>] ret_from_fork+0x25/0x30
[ 292.444078] ---[ end trace 721cfe26853c53b7 ]---


#endif /* IB_VERBS_H */

Thank you,
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bart Van Assche Jan. 23, 2017, 7:10 p.m. UTC | #1
On Mon, 2017-01-23 at 12:01 -0700, Robert LeBlanc wrote:
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index 8368764..ed36748 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -2089,3 +2089,19 @@ void ib_drain_qp(struct ib_qp *qp)
>                ib_drain_rq(qp);
> }
> EXPORT_SYMBOL(ib_drain_qp);
> +
> +void ib_reset_sq(struct ib_qp *qp)
> +{
> +       struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET};
> +       int ret;
> +
> +       ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
> +}
> +EXPORT_SYMBOL(ib_reset_sq);
> +
> +void ib_reset_qp(struct ib_qp *qp)
> +{
> +       printk("ib_reset_qp calling ib_reset_sq.\n");
> +       ib_reset_sq(qp);
> +}
> +EXPORT_SYMBOL(ib_reset_qp);

These are one liners. Is it really worth to add one-line functions to the
IB core?

> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
> b/drivers/infiniband/ulp/isert/ib_isert.c
> index 6dd43f6..619dbc7 100644
> --- a/drivers/infiniband/ulp/isert/ib_isert.c
> +++ b/drivers/infiniband/ulp/isert/ib_isert.c
> @@ -2595,10 +2595,9 @@ static void isert_wait_conn(struct iscsi_conn *conn)
>        isert_conn_terminate(isert_conn);
>        mutex_unlock(&isert_conn->mutex);
> 
> -       ib_drain_qp(isert_conn->qp);
> +       ib_reset_qp(isert_conn->qp);
>        isert_put_unsol_pending_cmds(conn);
> -       isert_wait4cmds(conn);
> -       isert_wait4logout(isert_conn);
> +       cancel_work_sync(&isert_conn->release_work);
> 
>        queue_work(isert_release_wq, &isert_conn->release_work);
> }

Sorry but leaving out the ib_drain_qp() and isert_wait*() calls seems wrong
to me. Additionally, resetting the send queue should not be needed since the
iSER target driver should guarantee that no new WRs will be queued on the
send queue after isert_wait_conn() is called.

Seeing this patch makes me wonder whether this behavior can be reproduced
with any other HBA than ConnectX-4 Lx? Is this a software or a firmware bug?

Thanks,

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 23, 2017, 7:24 p.m. UTC | #2
On Mon, Jan 23, 2017 at 12:10 PM, Bart Van Assche
<Bart.VanAssche@sandisk.com> wrote:
> On Mon, 2017-01-23 at 12:01 -0700, Robert LeBlanc wrote:
>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
>> index 8368764..ed36748 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -2089,3 +2089,19 @@ void ib_drain_qp(struct ib_qp *qp)
>>                ib_drain_rq(qp);
>> }
>> EXPORT_SYMBOL(ib_drain_qp);
>> +
>> +void ib_reset_sq(struct ib_qp *qp)
>> +{
>> +       struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET};
>> +       int ret;
>> +
>> +       ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
>> +}
>> +EXPORT_SYMBOL(ib_reset_sq);
>> +
>> +void ib_reset_qp(struct ib_qp *qp)
>> +{
>> +       printk("ib_reset_qp calling ib_reset_sq.\n");
>> +       ib_reset_sq(qp);
>> +}
>> +EXPORT_SYMBOL(ib_reset_qp);
>
> These are one liners. Is it really worth to add one-line functions to the
> IB core?
>
>> diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
>> b/drivers/infiniband/ulp/isert/ib_isert.c
>> index 6dd43f6..619dbc7 100644
>> --- a/drivers/infiniband/ulp/isert/ib_isert.c
>> +++ b/drivers/infiniband/ulp/isert/ib_isert.c
>> @@ -2595,10 +2595,9 @@ static void isert_wait_conn(struct iscsi_conn *conn)
>>        isert_conn_terminate(isert_conn);
>>        mutex_unlock(&isert_conn->mutex);
>>
>> -       ib_drain_qp(isert_conn->qp);
>> +       ib_reset_qp(isert_conn->qp);
>>        isert_put_unsol_pending_cmds(conn);
>> -       isert_wait4cmds(conn);
>> -       isert_wait4logout(isert_conn);
>> +       cancel_work_sync(&isert_conn->release_work);
>>
>>        queue_work(isert_release_wq, &isert_conn->release_work);
>> }
>
> Sorry but leaving out the ib_drain_qp() and isert_wait*() calls seems wrong
> to me. Additionally, resetting the send queue should not be needed since the
> iSER target driver should guarantee that no new WRs will be queued on the
> send queue after isert_wait_conn() is called.
>
> Seeing this patch makes me wonder whether this behavior can be reproduced
> with any other HBA than ConnectX-4 Lx? Is this a software or a firmware bug?
>
> Thanks,
>
> Bart.

Yes, it all feels wrong which is why I need some guidance. The
backtrace of the Infiniband and RoCE target D state processes are
identical, I believe that there is an additional bug in the
ConnectX-4-LX firmware that causes the D state problem on the target
to be triggered much easier. The release notes seem to indicate that
it may firmware bug may be fixed in 14.17.2020, but there is not a
SuperMicro version yet and I can't match up the board IDs with enough
confidence yet to flash the Mellanox firmware to test.

In summary, I think there are two bugs. One in iSER causing the target
to go into D state when something funky happens with the connection on
both Infiniband and RoCE. And a second one in the ConnectX-4-LX
firmware which easily triggers the first more critical issue.

Is there someway to inspect what may be in the queue pair to see what
may be blocking things?

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman Jan. 23, 2017, 8:20 p.m. UTC | #3
----- Original Message -----
> From: "Bart Van Assche" <Bart.VanAssche@sandisk.com>
> To: robert@leblancnet.us, linux-rdma@vger.kernel.org
> Sent: Monday, January 23, 2017 2:10:27 PM
> Subject: Re: [RFC] Clear out stuck ops to prevent iSER from going init D state
> 
> On Mon, 2017-01-23 at 12:01 -0700, Robert LeBlanc wrote:
> > diff --git a/drivers/infiniband/core/verbs.c
> > b/drivers/infiniband/core/verbs.c
> > index 8368764..ed36748 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -2089,3 +2089,19 @@ void ib_drain_qp(struct ib_qp *qp)
> >                ib_drain_rq(qp);
> > }
> > EXPORT_SYMBOL(ib_drain_qp);
> > +
> > +void ib_reset_sq(struct ib_qp *qp)
> > +{
> > +       struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET};
> > +       int ret;
> > +
> > +       ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
> > +}
> > +EXPORT_SYMBOL(ib_reset_sq);
> > +
> > +void ib_reset_qp(struct ib_qp *qp)
> > +{
> > +       printk("ib_reset_qp calling ib_reset_sq.\n");
> > +       ib_reset_sq(qp);
> > +}
> > +EXPORT_SYMBOL(ib_reset_qp);
> 
> These are one liners. Is it really worth to add one-line functions to the
> IB core?
> 
> > diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
> > b/drivers/infiniband/ulp/isert/ib_isert.c
> > index 6dd43f6..619dbc7 100644
> > --- a/drivers/infiniband/ulp/isert/ib_isert.c
> > +++ b/drivers/infiniband/ulp/isert/ib_isert.c
> > @@ -2595,10 +2595,9 @@ static void isert_wait_conn(struct iscsi_conn *conn)
> >        isert_conn_terminate(isert_conn);
> >        mutex_unlock(&isert_conn->mutex);
> > 
> > -       ib_drain_qp(isert_conn->qp);
> > +       ib_reset_qp(isert_conn->qp);
> >        isert_put_unsol_pending_cmds(conn);
> > -       isert_wait4cmds(conn);
> > -       isert_wait4logout(isert_conn);
> > +       cancel_work_sync(&isert_conn->release_work);
> > 
> >        queue_work(isert_release_wq, &isert_conn->release_work);
> > }
> 
> Sorry but leaving out the ib_drain_qp() and isert_wait*() calls seems wrong
> to me. Additionally, resetting the send queue should not be needed since the
> iSER target driver should guarantee that no new WRs will be queued on the
> send queue after isert_wait_conn() is called.
> 
> Seeing this patch makes me wonder whether this behavior can be reproduced
> with any other HBA than ConnectX-4 Lx? Is this a software or a firmware bug?
> 
> Thanks,
> 
> Bart.--
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello Bart

Its on my plate to try reproduce, I was not able to reproduce with my mlx4 and IB as I was not in the office to pull cables and I am back-to-back.
I also need to try Ethernet.

I hope to see if I can reproduce What Robert is seeing later this week as I am fully back in the office tomorrow.

If I can, I will try make some sense of this.

Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Laurence Oberman Jan. 23, 2017, 9:41 p.m. UTC | #4
----- Original Message -----
> From: "Laurence Oberman" <loberman@redhat.com>
> To: "Bart Van Assche" <Bart.VanAssche@sandisk.com>
> Cc: robert@leblancnet.us, linux-rdma@vger.kernel.org
> Sent: Monday, January 23, 2017 3:20:31 PM
> Subject: Re: [RFC] Clear out stuck ops to prevent iSER from going init D state
> 
> 
> 
> ----- Original Message -----
> > From: "Bart Van Assche" <Bart.VanAssche@sandisk.com>
> > To: robert@leblancnet.us, linux-rdma@vger.kernel.org
> > Sent: Monday, January 23, 2017 2:10:27 PM
> > Subject: Re: [RFC] Clear out stuck ops to prevent iSER from going init D
> > state
> > 
> > On Mon, 2017-01-23 at 12:01 -0700, Robert LeBlanc wrote:
> > > diff --git a/drivers/infiniband/core/verbs.c
> > > b/drivers/infiniband/core/verbs.c
> > > index 8368764..ed36748 100644
> > > --- a/drivers/infiniband/core/verbs.c
> > > +++ b/drivers/infiniband/core/verbs.c
> > > @@ -2089,3 +2089,19 @@ void ib_drain_qp(struct ib_qp *qp)
> > >                ib_drain_rq(qp);
> > > }
> > > EXPORT_SYMBOL(ib_drain_qp);
> > > +
> > > +void ib_reset_sq(struct ib_qp *qp)
> > > +{
> > > +       struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET};
> > > +       int ret;
> > > +
> > > +       ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
> > > +}
> > > +EXPORT_SYMBOL(ib_reset_sq);
> > > +
> > > +void ib_reset_qp(struct ib_qp *qp)
> > > +{
> > > +       printk("ib_reset_qp calling ib_reset_sq.\n");
> > > +       ib_reset_sq(qp);
> > > +}
> > > +EXPORT_SYMBOL(ib_reset_qp);
> > 
> > These are one liners. Is it really worth to add one-line functions to the
> > IB core?
> > 
> > > diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
> > > b/drivers/infiniband/ulp/isert/ib_isert.c
> > > index 6dd43f6..619dbc7 100644
> > > --- a/drivers/infiniband/ulp/isert/ib_isert.c
> > > +++ b/drivers/infiniband/ulp/isert/ib_isert.c
> > > @@ -2595,10 +2595,9 @@ static void isert_wait_conn(struct iscsi_conn
> > > *conn)
> > >        isert_conn_terminate(isert_conn);
> > >        mutex_unlock(&isert_conn->mutex);
> > > 
> > > -       ib_drain_qp(isert_conn->qp);
> > > +       ib_reset_qp(isert_conn->qp);
> > >        isert_put_unsol_pending_cmds(conn);
> > > -       isert_wait4cmds(conn);
> > > -       isert_wait4logout(isert_conn);
> > > +       cancel_work_sync(&isert_conn->release_work);
> > > 
> > >        queue_work(isert_release_wq, &isert_conn->release_work);
> > > }
> > 
> > Sorry but leaving out the ib_drain_qp() and isert_wait*() calls seems wrong
> > to me. Additionally, resetting the send queue should not be needed since
> > the
> > iSER target driver should guarantee that no new WRs will be queued on the
> > send queue after isert_wait_conn() is called.
> > 
> > Seeing this patch makes me wonder whether this behavior can be reproduced
> > with any other HBA than ConnectX-4 Lx? Is this a software or a firmware
> > bug?
> > 
> > Thanks,
> > 
> > Bart.--
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> Hello Bart
> 
> Its on my plate to try reproduce, I was not able to reproduce with my mlx4
> and IB as I was not in the office to pull cables and I am back-to-back.
> I also need to try Ethernet.
> 
> I hope to see if I can reproduce What Robert is seeing later this week as I
> am fully back in the office tomorrow.
> 
> If I can, I will try make some sense of this.
> 
> Thanks
> Laurence
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello Robert
Are you reproducing again on ISER/IB using the cable pull method.
Thanks
Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Jan. 23, 2017, 10:20 p.m. UTC | #5
Hi Robert,

Seeing this, makes me realize that the entire "iscsi_trx going
into D state" thread which I did not bother to read is actually an
iser-target related bug. I'm really sorry for not addressing this
sooner (much sooner).

The patch looks wrong to me, but lets try and talk about
the hang you are trying to solve.

> In certain circumstances the RDMA connection can be abruptly
> terminated,

Does it make a difference which port is causing the abruption?
Is it the target switch port? or the initiator switch port?

> but something is getting stuck preventing the iSCSI clean
> up commands from being completed.

I think this means, that at least one command was missing
a final kref_put and causing target_wait_for_sess_cmds() to
block forever.

> Just removing the isert_wait4*
> commands isn't enough.

Yes, because all the inflight IO needs to be properly cleaned up
for the session to terminate gracefully.

> Just resetting the queue pair isn't enough
> either.

This is true as well. Before tearing down the RDMA queue pair we
need to make sure we will never see a completion for it on its
corresponding completion queue. This is why ib_drain_qp exists.

> some help getting this patch fixed right as resetting the queue pair
> is probably not the right approach and overkill to solving the
> problem. I think it at least shows where the problem is occurring and
> how I can get around it.

As said, I have a feeling that we have a flow we are missing the last
kref_put on (at least) one of the session commands. The fact that this
involves port toggling, probably boils down to error completions.

Bart, I recall you had a patch at some point to periodically print
out the hanging session commands in target_wait_for_sess_cmds(), do we
want to get it in? I think we can all benefit from it.

Would it be possible to turn on isert debug_level=4 and send us the log?
$ echo 4 > /sys/module/ib_isert/parameters/debug_level

> The problem easily shows up with two ConnectX-4-LX card connected to a
> 10 Gb switch. The target is a RAM disk and the initiator just mounts
> it as ext4 and runs fio.

Can you please share the fio workload? Does this happen when for example
you run 100% read workload? or 100% write workload?

And, can you try and disable the unsolicited-data-out in the target
(IIRC its InitialR2T=Yes)? unsol dataout has been known to trigger
cause similar hangs before (which were supposed to be solved).

Also, can you please summarize what kernel versions do you see this
with? The previous thread is a bit hard to follow at once.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Jan. 23, 2017, 10:32 p.m. UTC | #6
On Tue, 2017-01-24 at 00:20 +0200, Sagi Grimberg wrote:
> Bart, I recall you had a patch at some point to periodically print
> out the hanging session commands in target_wait_for_sess_cmds(), do we
> want to get it in? I think we can all benefit from it.

(+target-devel)

Hello Sagi,

I have pushed the current state of these patches to the
scsi-target-for-v4.11 branch of
https://git.kernel.org/cgit/linux/kernel/git/bvanassche/linux.git/

So far these patches have been tested with iSCSI/TCP, ib_srpt and
qla2xxx drivers. The patches that are present on this branch are:

Bart Van Assche (36):
      qla2xxx: Avoid using variable-length arrays
      target/iscsi: Fix indentation in iscsi_target_start_negotiation()
      target/iscsi: Fix spelling of "perform"
      target/iscsi: Fix spelling of "reallegiance"
      target/iscsi: Introduce a helper function for TMF translation
      target/iscsi: Fix iSCSI task reassignment handling
      target/iscsi: Fix solicited data sequence offset calculations
      cxgbit: Fix endianness annotations
      target/tcm_fc: Remove a set-but-not-used variable
      target: Remove se_tmr_req.tmr_lun
      target: Make core_tmr_abort_task() consider all commands
      target: Make it possible to specify I_T nexus for SCSI abort
      tcm_qla2xxx: Let the target core look up the LUN of the aborted cmd
      target: Fix transport_wait_for_tasks() documentation
      target: Remove an overly chatty debug message
      target: Add a missing target_put_nacl() call
      target: Stop execution if CMD_T_STOP has been set
      target: Fix a reference leak in transport_cmd_check_stop_to_fabric()
      target: Inline transport_cmd_check_stop()
      target: Make ABORT and LUN RESET handling synchronous
      target: Simplify session shutdown code
      target: Remove the SCF_SEND_DELAYED_TAS command flag
      target: Inline transport_check_aborted_status()
      target: Remove the write_pending_status() callback function
      target: Remove several state tests from TMF code
      target: Remove command flag CMD_T_BUSY
      target: Simplify LUN RESET implementation
      target: Remove command flag CMD_T_DEV_ACTIVE
      target: Reduce number of __transport_wait_for_tasks() arguments
      target: Remove command flag CMD_T_TAS
      target: Remove unused arguments from __target_check_io_state()
      target: Change return type of transport_wait_for_tasks() into void
      target: Inline transport_put_cmd()
      target: Reduce the number of transport_lun_remove_cmd() callers
      target: Move target_remove_from_state_list() into target_release_cmd_kref()
      target: Avoid that XCOPY commands trigger a deadlock

Himanshu Madhani (2):
      qla2xxx: Remove SRR code
      qla2xxx: Remove unused reverse_ini_mode

Joe Carnuccio (1):
      qla2xxx: Simplify usage of SRB structure in driver

Quinn Tran (9):
      qla2xxx: Remove direct access of scsi_status field in se_cmd
      qla2xxx: Cleanup TMF code translation from qla_target
      qla2xxx: Make trace flags more readable
      qla2xxx: Fix wrong argument in sp done callback
      qla2xxx: Use d_id instead of s_id for more clarity
      qla2xxx: Track I-T nexus as single fc_port struct
      qla2xxx: Add framework for Async fabric discovery
      qla2xxx: Add Dual mode support in the driver
      qla2xxx: Improve RSCN handling in driver

Varun Prakash (1):
      target/iscsi: Fix unsolicited data seq_end_offset calculation

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 23, 2017, 10:42 p.m. UTC | #7
I still haven't been able to reproduce it on Infiniband like I was
able to a month ago. I'm still trying to figure out what I was doing
differently.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jan 23, 2017 at 2:41 PM, Laurence Oberman <loberman@redhat.com> wrote:
>
>
> ----- Original Message -----
>> From: "Laurence Oberman" <loberman@redhat.com>
>> To: "Bart Van Assche" <Bart.VanAssche@sandisk.com>
>> Cc: robert@leblancnet.us, linux-rdma@vger.kernel.org
>> Sent: Monday, January 23, 2017 3:20:31 PM
>> Subject: Re: [RFC] Clear out stuck ops to prevent iSER from going init D state
>>
>>
>>
>> ----- Original Message -----
>> > From: "Bart Van Assche" <Bart.VanAssche@sandisk.com>
>> > To: robert@leblancnet.us, linux-rdma@vger.kernel.org
>> > Sent: Monday, January 23, 2017 2:10:27 PM
>> > Subject: Re: [RFC] Clear out stuck ops to prevent iSER from going init D
>> > state
>> >
>> > On Mon, 2017-01-23 at 12:01 -0700, Robert LeBlanc wrote:
>> > > diff --git a/drivers/infiniband/core/verbs.c
>> > > b/drivers/infiniband/core/verbs.c
>> > > index 8368764..ed36748 100644
>> > > --- a/drivers/infiniband/core/verbs.c
>> > > +++ b/drivers/infiniband/core/verbs.c
>> > > @@ -2089,3 +2089,19 @@ void ib_drain_qp(struct ib_qp *qp)
>> > >                ib_drain_rq(qp);
>> > > }
>> > > EXPORT_SYMBOL(ib_drain_qp);
>> > > +
>> > > +void ib_reset_sq(struct ib_qp *qp)
>> > > +{
>> > > +       struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET};
>> > > +       int ret;
>> > > +
>> > > +       ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
>> > > +}
>> > > +EXPORT_SYMBOL(ib_reset_sq);
>> > > +
>> > > +void ib_reset_qp(struct ib_qp *qp)
>> > > +{
>> > > +       printk("ib_reset_qp calling ib_reset_sq.\n");
>> > > +       ib_reset_sq(qp);
>> > > +}
>> > > +EXPORT_SYMBOL(ib_reset_qp);
>> >
>> > These are one liners. Is it really worth to add one-line functions to the
>> > IB core?
>> >
>> > > diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
>> > > b/drivers/infiniband/ulp/isert/ib_isert.c
>> > > index 6dd43f6..619dbc7 100644
>> > > --- a/drivers/infiniband/ulp/isert/ib_isert.c
>> > > +++ b/drivers/infiniband/ulp/isert/ib_isert.c
>> > > @@ -2595,10 +2595,9 @@ static void isert_wait_conn(struct iscsi_conn
>> > > *conn)
>> > >        isert_conn_terminate(isert_conn);
>> > >        mutex_unlock(&isert_conn->mutex);
>> > >
>> > > -       ib_drain_qp(isert_conn->qp);
>> > > +       ib_reset_qp(isert_conn->qp);
>> > >        isert_put_unsol_pending_cmds(conn);
>> > > -       isert_wait4cmds(conn);
>> > > -       isert_wait4logout(isert_conn);
>> > > +       cancel_work_sync(&isert_conn->release_work);
>> > >
>> > >        queue_work(isert_release_wq, &isert_conn->release_work);
>> > > }
>> >
>> > Sorry but leaving out the ib_drain_qp() and isert_wait*() calls seems wrong
>> > to me. Additionally, resetting the send queue should not be needed since
>> > the
>> > iSER target driver should guarantee that no new WRs will be queued on the
>> > send queue after isert_wait_conn() is called.
>> >
>> > Seeing this patch makes me wonder whether this behavior can be reproduced
>> > with any other HBA than ConnectX-4 Lx? Is this a software or a firmware
>> > bug?
>> >
>> > Thanks,
>> >
>> > Bart.--
>> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>>
>> Hello Bart
>>
>> Its on my plate to try reproduce, I was not able to reproduce with my mlx4
>> and IB as I was not in the office to pull cables and I am back-to-back.
>> I also need to try Ethernet.
>>
>> I hope to see if I can reproduce What Robert is seeing later this week as I
>> am fully back in the office tomorrow.
>>
>> If I can, I will try make some sense of this.
>>
>> Thanks
>> Laurence
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> Hello Robert
> Are you reproducing again on ISER/IB using the cable pull method.
> Thanks
> Laurence
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Jan. 23, 2017, 10:43 p.m. UTC | #8
>> Bart, I recall you had a patch at some point to periodically print
>> out the hanging session commands in target_wait_for_sess_cmds(), do we
>> want to get it in? I think we can all benefit from it.
>
> (+target-devel)
>
> Hello Sagi,

Hey Bart,

>
> I have pushed the current state of these patches to the
> scsi-target-for-v4.11 branch of
> https://git.kernel.org/cgit/linux/kernel/git/bvanassche/linux.git/

Awesome,

Robert, would it be possible to give the above tree:branch a try?

I'll be surprised if this doesn't reproduce here, but at least try
and get some valuable information.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 23, 2017, 11:12 p.m. UTC | #9
On Mon, Jan 23, 2017 at 3:20 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> Hi Robert,
>
> Seeing this, makes me realize that the entire "iscsi_trx going
> into D state" thread which I did not bother to read is actually an
> iser-target related bug. I'm really sorry for not addressing this
> sooner (much sooner).

I was hesitant to start a new thread and fragment the discussion and
cause confusion. I thought by presenting my findings as an RFC with
some code, I might get some new ideas.

> The patch looks wrong to me, but lets try and talk about
> the hang you are trying to solve.
>
>> In certain circumstances the RDMA connection can be abruptly
>> terminated,
>
>
> Does it make a difference which port is causing the abruption?
> Is it the target switch port? or the initiator switch port?

It seems more tied to initiator port, but previously with our target
export scripts it also seems that a target crash and re-export
(generally out of order) could also cause the issue. My theory is that
the session didn't mach enough and because of that would not get torn
down completely. It is more of speculation from observations and
thought than anything concrete.

>> but something is getting stuck preventing the iSCSI clean
>> up commands from being completed.
>
>
> I think this means, that at least one command was missing
> a final kref_put and causing target_wait_for_sess_cmds() to
> block forever.
>
>> Just removing the isert_wait4*
>> commands isn't enough.
>
>
> Yes, because all the inflight IO needs to be properly cleaned up
> for the session to terminate gracefully.
>
>> Just resetting the queue pair isn't enough
>> either.
>
>
> This is true as well. Before tearing down the RDMA queue pair we
> need to make sure we will never see a completion for it on its
> corresponding completion queue. This is why ib_drain_qp exists.
>
>> some help getting this patch fixed right as resetting the queue pair
>> is probably not the right approach and overkill to solving the
>> problem. I think it at least shows where the problem is occurring and
>> how I can get around it.
>
>
> As said, I have a feeling that we have a flow we are missing the last
> kref_put on (at least) one of the session commands. The fact that this
> involves port toggling, probably boils down to error completions.
>
> Bart, I recall you had a patch at some point to periodically print
> out the hanging session commands in target_wait_for_sess_cmds(), do we
> want to get it in? I think we can all benefit from it.

Do you have a link to the path and I can add it in?

> Would it be possible to turn on isert debug_level=4 and send us the log?
> $ echo 4 > /sys/module/ib_isert/parameters/debug_level

I'll get that to you tomorrow.

>> The problem easily shows up with two ConnectX-4-LX card connected to a
>> 10 Gb switch. The target is a RAM disk and the initiator just mounts
>> it as ext4 and runs fio.
>
>
> Can you please share the fio workload? Does this happen when for example
> you run 100% read workload? or 100% write workload?

echo "3" > /proc/sys/vm/drop_caches; fio --rw=read --bs=4K --size=1G
--numjobs=40 --name=worker.matt --group_reporting

With the ConnectX-4-LX cards, it will usually cause the issue while
laying out the first four files, usually on the first one or two. When
I was able to replicate it on IB, I believe I was able to replicate by
pulling the cable on lay out or during the read portion of the test.
So I don't think it matters what the workload is, but I think it needs
a workload of some sort.

> And, can you try and disable the unsolicited-data-out in the target
> (IIRC its InitialR2T=Yes)? unsol dataout has been known to trigger
> cause similar hangs before (which were supposed to be solved).

I'll figure out how to do this and test it tomorrow and let you know
the results.

> Also, can you please summarize what kernel versions do you see this
> with? The previous thread is a bit hard to follow at once.

4.9, 4.4.x, 4.1.x (we believe we saw it here, but it has been a long
time since we have run this version.)

Thanks for the response Sagi, I'm happy to have new things to try, I
was really lost at where to go next.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 23, 2017, 11:15 p.m. UTC | #10
Yeah, I'll checkout this branch and try it.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jan 23, 2017 at 3:43 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>> Bart, I recall you had a patch at some point to periodically print
>>> out the hanging session commands in target_wait_for_sess_cmds(), do we
>>> want to get it in? I think we can all benefit from it.
>>
>>
>> (+target-devel)
>>
>> Hello Sagi,
>
>
> Hey Bart,
>
>>
>> I have pushed the current state of these patches to the
>> scsi-target-for-v4.11 branch of
>> https://git.kernel.org/cgit/linux/kernel/git/bvanassche/linux.git/
>
>
> Awesome,
>
> Robert, would it be possible to give the above tree:branch a try?
>
> I'll be surprised if this doesn't reproduce here, but at least try
> and get some valuable information.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Max Gurtovoy Jan. 24, 2017, 10:37 a.m. UTC | #11
Robert,
can you please try replacing the ib_drain_qp with ib_drain_rq ? or make 
sure ib_drain_qp don't stuck.
in the past (before v4.6) we sent "recv_beacon" on the isert disconnect 
flow.
I suspect that area.

Max.

On 1/24/2017 1:12 AM, Robert LeBlanc wrote:
> On Mon, Jan 23, 2017 at 3:20 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>> Hi Robert,
>>
>> Seeing this, makes me realize that the entire "iscsi_trx going
>> into D state" thread which I did not bother to read is actually an
>> iser-target related bug. I'm really sorry for not addressing this
>> sooner (much sooner).
>
> I was hesitant to start a new thread and fragment the discussion and
> cause confusion. I thought by presenting my findings as an RFC with
> some code, I might get some new ideas.
>
>> The patch looks wrong to me, but lets try and talk about
>> the hang you are trying to solve.
>>
>>> In certain circumstances the RDMA connection can be abruptly
>>> terminated,
>>
>>
>> Does it make a difference which port is causing the abruption?
>> Is it the target switch port? or the initiator switch port?
>
> It seems more tied to initiator port, but previously with our target
> export scripts it also seems that a target crash and re-export
> (generally out of order) could also cause the issue. My theory is that
> the session didn't mach enough and because of that would not get torn
> down completely. It is more of speculation from observations and
> thought than anything concrete.
>
>>> but something is getting stuck preventing the iSCSI clean
>>> up commands from being completed.
>>
>>
>> I think this means, that at least one command was missing
>> a final kref_put and causing target_wait_for_sess_cmds() to
>> block forever.
>>
>>> Just removing the isert_wait4*
>>> commands isn't enough.
>>
>>
>> Yes, because all the inflight IO needs to be properly cleaned up
>> for the session to terminate gracefully.
>>
>>> Just resetting the queue pair isn't enough
>>> either.
>>
>>
>> This is true as well. Before tearing down the RDMA queue pair we
>> need to make sure we will never see a completion for it on its
>> corresponding completion queue. This is why ib_drain_qp exists.
>>
>>> some help getting this patch fixed right as resetting the queue pair
>>> is probably not the right approach and overkill to solving the
>>> problem. I think it at least shows where the problem is occurring and
>>> how I can get around it.
>>
>>
>> As said, I have a feeling that we have a flow we are missing the last
>> kref_put on (at least) one of the session commands. The fact that this
>> involves port toggling, probably boils down to error completions.
>>
>> Bart, I recall you had a patch at some point to periodically print
>> out the hanging session commands in target_wait_for_sess_cmds(), do we
>> want to get it in? I think we can all benefit from it.
>
> Do you have a link to the path and I can add it in?
>
>> Would it be possible to turn on isert debug_level=4 and send us the log?
>> $ echo 4 > /sys/module/ib_isert/parameters/debug_level
>
> I'll get that to you tomorrow.
>
>>> The problem easily shows up with two ConnectX-4-LX card connected to a
>>> 10 Gb switch. The target is a RAM disk and the initiator just mounts
>>> it as ext4 and runs fio.
>>
>>
>> Can you please share the fio workload? Does this happen when for example
>> you run 100% read workload? or 100% write workload?
>
> echo "3" > /proc/sys/vm/drop_caches; fio --rw=read --bs=4K --size=1G
> --numjobs=40 --name=worker.matt --group_reporting
>
> With the ConnectX-4-LX cards, it will usually cause the issue while
> laying out the first four files, usually on the first one or two. When
> I was able to replicate it on IB, I believe I was able to replicate by
> pulling the cable on lay out or during the read portion of the test.
> So I don't think it matters what the workload is, but I think it needs
> a workload of some sort.
>
>> And, can you try and disable the unsolicited-data-out in the target
>> (IIRC its InitialR2T=Yes)? unsol dataout has been known to trigger
>> cause similar hangs before (which were supposed to be solved).
>
> I'll figure out how to do this and test it tomorrow and let you know
> the results.
>
>> Also, can you please summarize what kernel versions do you see this
>> with? The previous thread is a bit hard to follow at once.
>
> 4.9, 4.4.x, 4.1.x (we believe we saw it here, but it has been a long
> time since we have run this version.)
>
> Thanks for the response Sagi, I'm happy to have new things to try, I
> was really lost at where to go next.
>
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 24, 2017, 9:49 p.m. UTC | #12
On Mon, Jan 23, 2017 at 4:12 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> On Mon, Jan 23, 2017 at 3:20 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
>> Hi Robert,
>>
>> Seeing this, makes me realize that the entire "iscsi_trx going
>> into D state" thread which I did not bother to read is actually an
>> iser-target related bug. I'm really sorry for not addressing this
>> sooner (much sooner).
>
> I was hesitant to start a new thread and fragment the discussion and
> cause confusion. I thought by presenting my findings as an RFC with
> some code, I might get some new ideas.
>
>> The patch looks wrong to me, but lets try and talk about
>> the hang you are trying to solve.
>>
>>> In certain circumstances the RDMA connection can be abruptly
>>> terminated,
>>
>>
>> Does it make a difference which port is causing the abruption?
>> Is it the target switch port? or the initiator switch port?
>
> It seems more tied to initiator port, but previously with our target
> export scripts it also seems that a target crash and re-export
> (generally out of order) could also cause the issue. My theory is that
> the session didn't mach enough and because of that would not get torn
> down completely. It is more of speculation from observations and
> thought than anything concrete.
>
>>> but something is getting stuck preventing the iSCSI clean
>>> up commands from being completed.
>>
>>
>> I think this means, that at least one command was missing
>> a final kref_put and causing target_wait_for_sess_cmds() to
>> block forever.
>>
>>> Just removing the isert_wait4*
>>> commands isn't enough.
>>
>>
>> Yes, because all the inflight IO needs to be properly cleaned up
>> for the session to terminate gracefully.
>>
>>> Just resetting the queue pair isn't enough
>>> either.
>>
>>
>> This is true as well. Before tearing down the RDMA queue pair we
>> need to make sure we will never see a completion for it on its
>> corresponding completion queue. This is why ib_drain_qp exists.
>>
>>> some help getting this patch fixed right as resetting the queue pair
>>> is probably not the right approach and overkill to solving the
>>> problem. I think it at least shows where the problem is occurring and
>>> how I can get around it.
>>
>>
>> As said, I have a feeling that we have a flow we are missing the last
>> kref_put on (at least) one of the session commands. The fact that this
>> involves port toggling, probably boils down to error completions.
>>
>> Bart, I recall you had a patch at some point to periodically print
>> out the hanging session commands in target_wait_for_sess_cmds(), do we
>> want to get it in? I think we can all benefit from it.
>
> Do you have a link to the path and I can add it in?
>
>> Would it be possible to turn on isert debug_level=4 and send us the log?
>> $ echo 4 > /sys/module/ib_isert/parameters/debug_level
>
> I'll get that to you tomorrow.

This is using the scsi-target-for-v4.11 branch and debug_level set to 4:

[ 7181.311934] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 7181.311939] isert: isert_create_send_desc: tx_desc ffff88653e371ad8
lkey mismatch, fixing
[ 7181.311940] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e371ad8 length: 76 lkey: 0x4260
[ 7181.311940] isert: isert_put_response: Posting SCSI Response
[ 7181.311941] isert: isert_create_send_desc: tx_desc ffff88653e3586c0
lkey mismatch, fixing
[ 7181.311942] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e3586c0 length: 76 lkey: 0x4260
[ 7181.311943] isert: isert_put_response: Posting SCSI Response
[ 7181.311944] isert: isert_create_send_desc: tx_desc ffff88653e365fe0
lkey mismatch, fixing
[ 7181.311944] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e365fe0 length: 76 lkey: 0x4260
[ 7181.311945] isert: isert_put_response: Posting SCSI Response
[ 7181.311945] isert: isert_create_send_desc: tx_desc ffff88653e36a238
lkey mismatch, fixing
[ 7181.311962] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e36a238 length: 76 lkey: 0x4260
[ 7181.311963] isert: isert_put_response: Posting SCSI Response
[ 7181.311964] isert: isert_create_send_desc: tx_desc ffff88653e3671f8
lkey mismatch, fixing
[ 7181.311964] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e3671f8 length: 76 lkey: 0x4260
[ 7181.311964] isert: isert_put_response: Posting SCSI Response
[ 7181.311965] isert: isert_create_send_desc: tx_desc ffff88653e348970
lkey mismatch, fixing
[ 7181.311966] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e348970 length: 76 lkey: 0x4260
[ 7181.311966] isert: isert_put_response: Posting SCSI Response
[ 7181.311967] isert: isert_create_send_desc: tx_desc ffff88653e36cc70
lkey mismatch, fixing
[ 7181.311968] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e36cc70 length: 76 lkey: 0x4260
[ 7181.311968] isert: isert_put_response: Posting SCSI Response
[ 7181.311969] isert: isert_create_send_desc: tx_desc ffff88653e362390
lkey mismatch, fixing
[ 7181.311969] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e362390 length: 76 lkey: 0x4260
[ 7181.311969] isert: isert_put_response: Posting SCSI Response
[ 7181.311984] isert: isert_recv_done: DMA: 0x7eaa406000, iSCSI
opcode: 0x01, ITT: 0x0000013b, flags: 0xa1 dlen: 4096
[ 7181.311985] isert: isert_create_send_desc: tx_desc ffff88653e3428f0
lkey mismatch, fixing
[ 7181.311986] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e3428f0 length: 76 lkey: 0x4260
[ 7181.311987] isert: isert_put_response: Posting SCSI Response
[ 7181.311988] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 7181.311989] isert: isert_create_send_desc: tx_desc ffff88653e367800
lkey mismatch, fixing
[ 7181.311989] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e367800 length: 76 lkey: 0x4260
[ 7181.311990] isert: isert_put_response: Posting SCSI Response
[ 7181.311991] isert: isert_create_send_desc: tx_desc ffff88653e35cf20
lkey mismatch, fixing
[ 7181.311991] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 7181.311992] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e35cf20 length: 76 lkey: 0x4260
[ 7181.311993] isert: isert_put_response: Posting SCSI Response
[ 7181.311994] isert: isert_create_send_desc: tx_desc ffff88653e370ec8
lkey mismatch, fixing
[ 7181.311994] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e370ec8 length: 76 lkey: 0x4260
[ 7181.311994] isert: isert_put_response: Posting SCSI Response
[ 7181.311996] isert: isert_recv_done: DMA: 0x7eaa49f000, iSCSI
opcode: 0x01, ITT: 0x0000013e, flags: 0xa1 dlen: 4096
[ 7181.311997] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 7181.311998] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 7181.312000] isert: isert_recv_done: DMA: 0x7eaa424000, iSCSI
opcode: 0x01, ITT: 0x00000171, flags: 0xa1 dlen: 4096
[ 7181.312000] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 7181.312002] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 7181.312004] isert: isert_recv_done: DMA: 0x7eaa4d2000, iSCSI
opcode: 0x01, ITT: 0x00000043, flags: 0xa1 dlen: 4096
[ 7181.312004] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 7181.312005] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 7181.312007] isert: isert_recv_done: DMA: 0x7eaa57a000, iSCSI
opcode: 0x01, ITT: 0x000001ab, flags: 0xa1 dlen: 4096
[ 7181.312007] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 7181.312009] isert: isert_create_send_desc: tx_desc ffff88653e34dde0
lkey mismatch, fixing
[ 7181.312010] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e34dde0 length: 76 lkey: 0x4260
[ 7181.312011] isert: isert_put_response: Posting SCSI Response
[ 7181.312012] isert: isert_create_send_desc: tx_desc ffff88653e36d278
lkey mismatch, fixing
[ 7181.312013] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e36d278 length: 76 lkey: 0x4260
[ 7181.312013] isert: isert_put_response: Posting SCSI Response
[ 7181.312014] isert: isert_create_send_desc: tx_desc ffff88653e36a840
lkey mismatch, fixing
[ 7181.312015] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e36a840 length: 76 lkey: 0x4260
[ 7181.312015] isert: isert_put_response: Posting SCSI Response
[ 7181.312016] isert: isert_create_send_desc: tx_desc ffff88653e36fcb0
lkey mismatch, fixing
[ 7181.312017] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x653e36fcb0 length: 76 lkey: 0x4260
[ 7181.312017] isert: isert_put_response: Posting SCSI Response
[ 7181.312033] systemd-journald[1358]: /dev/kmsg buffer overrun, some
messages lost.
[ 7191.370368] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e684400 np ffff887f1bed5d80
[ 7191.380635] isert: isert_wait_conn: Starting conn ffff887f25f2e000
[ 7191.387667] isert: isert_conn_terminate: Terminating conn
ffff887f25f2e000 state 3
[ 7193.456737] isert: isert_cma_handler: timewait exit (15): status 0
id ffff88653e684400 np ffff887f1bed5d80
[ 7193.660233] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e595400 np ffff887f1bed5d80
[ 7193.660235] isert: isert_connect_request: cma_id: ffff88653e595400,
portal: ffff887f1bed5d80
[ 7193.660258] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 2
[ 7193.660258] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7193.660259] isert: isert_set_nego_params: Using remote invalidation
[ 7193.660261] isert: isert_comp_get: conn ffff887f25f2c000, using
comp ffff88653e684818 min_index: 1
[ 7193.661196] isert: isert_login_post_recv: Setup sge: addr:
653e5fc000 length: 8268 0x00004260
[ 7193.662041] isert: isert_cma_handler: established (9): status 0 id
ffff88653e595400 np ffff887f1bed5d80
[ 7193.662043] isert: isert_connected_handler: conn ffff887f25f2c000
[ 7193.662067] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7193.662085] isert: isert_accept_np: Processing isert_conn: ffff887f25f2c000
[ 7193.662101] isert: isert_get_login_rx: before login_req comp conn:
ffff887f25f2c000
[ 7193.929623] isert: isert_rx_login_req: conn ffff887f25f2c000
[ 7193.935895] isert: isert_rx_login_req: Using login payload size:
504, rx_buflen: 504 MAX_KEY_VALUE_PAIRS: 8192
[ 7193.935895] isert: isert_get_login_rx: before login_comp conn:
ffff887f25a99800
[ 7193.943843] isert: isert_get_login_rx: processing login->req:
ffff887ea2962ce4
[ 7193.951790] isert: isert_get_sup_prot_ops: conn ffff887f25f2c000 PI
offload disabled
[ 7208.696191] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e595400 np ffff887f1bed5d80
[ 7209.297524] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[ 7210.949920] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e594800 np ffff887f1bed5d80
[ 7210.960293] isert: isert_connect_request: cma_id: ffff88653e594800,
portal: ffff887f1bed5d80
[ 7210.960307] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7210.968496] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7210.968496] isert: isert_set_nego_params: Using remote invalidation
[ 7210.975350] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7210.985314] isert: isert_login_post_recv: Setup sge: addr:
7eaa7b0000 length: 8268 0x00004260
[ 7210.986162] isert: isert_cma_handler: established (9): status 0 id
ffff88653e594800 np ffff887f1bed5d80
[ 7210.996156] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7211.002839] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7226.017139] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e594800 np ffff887f1bed5d80
[ 7226.027390] isert: isert_conn_terminate: Terminating conn
ffff887f25f2a000 state 1
[ 7226.036933] isert: isert_release_work: Starting release conn ffff887f25f2a000
[ 7226.044659] isert: isert_release_work: Destroying conn ffff887f25f2a000
[ 7226.051852] isert: isert_release_kref: conn ffff887f25f2a000 final
kref kworker/u66:1/310
[ 7226.060626] isert: isert_connect_release: conn ffff887f25f2a000
[ 7226.061592] isert: isert_device_put: device ffff883f590c6ac0 refcount 2
[ 7228.270580] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e594c00 np ffff887f1bed5d80
[ 7228.280959] isert: isert_connect_request: cma_id: ffff88653e594c00,
portal: ffff887f1bed5d80
[ 7228.280972] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7228.289173] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7228.289174] isert: isert_set_nego_params: Using remote invalidation
[ 7228.296030] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7228.306105] isert: isert_login_post_recv: Setup sge: addr:
653d538000 length: 8268 0x00004260
[ 7228.306862] isert: isert_cma_handler: established (9): status 0 id
ffff88653e594c00 np ffff887f1bed5d80
[ 7228.316851] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7228.323580] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7243.336981] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e594c00 np ffff887f1bed5d80
[ 7243.347320] isert: isert_conn_terminate: Terminating conn
ffff887f25f2a000 state 1
[ 7243.356878] isert: isert_release_work: Starting release conn ffff887f25f2a000
[ 7243.364667] isert: isert_release_work: Destroying conn ffff887f25f2a000
[ 7243.371927] isert: isert_release_kref: conn ffff887f25f2a000 final
kref kworker/u66:2/22941
[ 7243.380940] isert: isert_connect_release: conn ffff887f25f2a000
[ 7243.381748] isert: isert_device_put: device ffff883f590c6ac0 refcount 2
[ 7245.590796] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e597800 np ffff887f1bed5d80
[ 7245.601233] isert: isert_connect_request: cma_id: ffff88653e597800,
portal: ffff887f1bed5d80
[ 7245.601247] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7245.609505] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7245.609505] isert: isert_set_nego_params: Using remote invalidation
[ 7245.616450] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7245.626606] isert: isert_login_post_recv: Setup sge: addr:
653d7f4000 length: 8268 0x00004260
[ 7245.627327] isert: isert_cma_handler: established (9): status 0 id
ffff88653e597800 np ffff887f1bed5d80
[ 7245.637403] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7245.644168] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7260.657275] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e597800 np ffff887f1bed5d80
[ 7260.667577] isert: isert_conn_terminate: Terminating conn
ffff887f25f2a000 state 1
[ 7260.677085] isert: isert_release_work: Starting release conn ffff887f25f2a000
[ 7260.684908] isert: isert_release_work: Destroying conn ffff887f25f2a000
[ 7260.692193] isert: isert_release_kref: conn ffff887f25f2a000 final
kref kworker/u66:2/22941
[ 7260.701219] isert: isert_connect_release: conn ffff887f25f2a000
[ 7260.702013] isert: isert_device_put: device ffff883f590c6ac0 refcount 2
[ 7262.911031] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e592400 np ffff887f1bed5d80
[ 7262.921456] isert: isert_connect_request: cma_id: ffff88653e592400,
portal: ffff887f1bed5d80
[ 7262.921461] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7262.929699] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7262.929699] isert: isert_set_nego_params: Using remote invalidation
[ 7262.936642] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7262.946822] isert: isert_login_post_recv: Setup sge: addr:
653d7f4000 length: 8268 0x00004260
[ 7262.947545] isert: isert_cma_handler: established (9): status 0 id
ffff88653e592400 np ffff887f1bed5d80
[ 7262.957632] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7262.964418] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7277.977210] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e592400 np ffff887f1bed5d80
[ 7277.987624] isert: isert_conn_terminate: Terminating conn
ffff887f25f2a000 state 1
[ 7277.996977] isert: isert_release_work: Starting release conn ffff887f25f2a000
[ 7278.004803] isert: isert_release_work: Destroying conn ffff887f25f2a000
[ 7278.012103] isert: isert_release_kref: conn ffff887f25f2a000 final
kref kworker/u66:1/310
[ 7278.020951] isert: isert_connect_release: conn ffff887f25f2a000
[ 7278.022070] isert: isert_device_put: device ffff883f590c6ac0 refcount 2
[ 7280.230596] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e594000 np ffff887f1bed5d80
[ 7280.241093] isert: isert_connect_request: cma_id: ffff88653e594000,
portal: ffff887f1bed5d80
[ 7280.241106] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7280.249396] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7280.249396] isert: isert_set_nego_params: Using remote invalidation
[ 7280.256387] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7280.266607] isert: isert_login_post_recv: Setup sge: addr:
653d7f4000 length: 8268 0x00004260
[ 7280.267304] isert: isert_cma_handler: established (9): status 0 id
ffff88653e594000 np ffff887f1bed5d80
[ 7280.277408] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7280.284172] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7295.297259] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e594000 np ffff887f1bed5d80
[ 7295.307589] isert: isert_conn_terminate: Terminating conn
ffff887f25f2a000 state 1
[ 7295.317063] isert: isert_release_work: Starting release conn ffff887f25f2a000
[ 7295.324897] isert: isert_release_work: Destroying conn ffff887f25f2a000
[ 7295.332190] isert: isert_release_kref: conn ffff887f25f2a000 final
kref kworker/u66:1/310
[ 7295.341026] isert: isert_connect_release: conn ffff887f25f2a000
[ 7295.341884] isert: isert_device_put: device ffff883f590c6ac0 refcount 2
[ 7297.550811] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e594000 np ffff887f1bed5d80
[ 7297.561237] isert: isert_connect_request: cma_id: ffff88653e594000,
portal: ffff887f1bed5d80
[ 7297.561243] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7297.569490] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7297.569491] isert: isert_set_nego_params: Using remote invalidation
[ 7297.576443] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7297.586594] isert: isert_login_post_recv: Setup sge: addr:
653d7f4000 length: 8268 0x00004260
[ 7297.587294] isert: isert_cma_handler: established (9): status 0 id
ffff88653e594000 np ffff887f1bed5d80
[ 7297.597371] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7297.604160] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7312.617204] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e594000 np ffff887f1bed5d80
[ 7312.627513] isert: isert_conn_terminate: Terminating conn
ffff887f25f2a000 state 1
[ 7312.636920] isert: isert_release_work: Starting release conn ffff887f25f2a000
[ 7312.644733] isert: isert_release_work: Destroying conn ffff887f25f2a000
[ 7312.652031] isert: isert_release_kref: conn ffff887f25f2a000 final
kref kworker/u66:1/310
[ 7312.660902] isert: isert_connect_release: conn ffff887f25f2a000
[ 7312.661725] isert: isert_device_put: device ffff883f590c6ac0 refcount 2
[ 7315.448740] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e592400 np ffff887f1bed5d80
[ 7315.459216] isert: isert_connect_request: cma_id: ffff88653e592400,
portal: ffff887f1bed5d80
[ 7315.459222] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7315.467509] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7315.467509] isert: isert_set_nego_params: Using remote invalidation
[ 7315.474494] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7315.484601] isert: isert_login_post_recv: Setup sge: addr:
653d7f4000 length: 8268 0x00004260
[ 7315.485289] isert: isert_cma_handler: established (9): status 0 id
ffff88653e592400 np ffff887f1bed5d80
[ 7315.495398] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7315.502218] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue
[ 7330.485314] isert: isert_cma_handler: disconnected (10): status 0
id ffff88653e592400 np ffff887f1bed5d80
[ 7330.495629] isert: isert_conn_terminate: Terminating conn
ffff887f25f2a000 state 1
[ 7330.504877] isert: isert_release_work: Starting release conn ffff887f25f2a000
[ 7330.512694] isert: isert_release_work: Destroying conn ffff887f25f2a000
[ 7330.519994] isert: isert_release_kref: conn ffff887f25f2a000 final
kref kworker/u66:2/22941
[ 7330.529032] isert: isert_connect_release: conn ffff887f25f2a000
[ 7330.529990] isert: isert_device_put: device ffff883f590c6ac0 refcount 2
[ 7332.782764] isert: isert_cma_handler: connect request (4): status 0
id ffff88653e597800 np ffff887f1bed5d80
[ 7332.793278] isert: isert_connect_request: cma_id: ffff88653e597800,
portal: ffff887f1bed5d80
[ 7332.793284] isert: isert_device_get: Found iser device
ffff883f590c6ac0 refcount 3
[ 7332.801642] isert: isert_set_nego_params: Using initiator_depth: 16
[ 7332.801656] isert: isert_set_nego_params: Using remote invalidation
[ 7332.808665] isert: isert_comp_get: conn ffff887f25f2a000, using
comp ffff88653e684830 min_index: 2
[ 7332.818813] isert: isert_login_post_recv: Setup sge: addr:
653d7f4000 length: 8268 0x00004260
[ 7332.819745] isert: isert_cma_handler: established (9): status 0 id
ffff88653e597800 np ffff887f1bed5d80
[ 7332.829836] isert: isert_connected_handler: conn ffff887f25f2a000
[ 7332.836621] isert: isert_connected_handler: np ffff887f1bed5d80:
Allow accept_np to continue

>>> The problem easily shows up with two ConnectX-4-LX card connected to a
>>> 10 Gb switch. The target is a RAM disk and the initiator just mounts
>>> it as ext4 and runs fio.
>>
>>
>> Can you please share the fio workload? Does this happen when for example
>> you run 100% read workload? or 100% write workload?
>
> echo "3" > /proc/sys/vm/drop_caches; fio --rw=read --bs=4K --size=1G
> --numjobs=40 --name=worker.matt --group_reporting
>
> With the ConnectX-4-LX cards, it will usually cause the issue while
> laying out the first four files, usually on the first one or two. When
> I was able to replicate it on IB, I believe I was able to replicate by
> pulling the cable on lay out or during the read portion of the test.
> So I don't think it matters what the workload is, but I think it needs
> a workload of some sort.
>
>> And, can you try and disable the unsolicited-data-out in the target
>> (IIRC its InitialR2T=Yes)? unsol dataout has been known to trigger
>> cause similar hangs before (which were supposed to be solved).
>
> I'll figure out how to do this and test it tomorrow and let you know
> the results.

Same branch and debug as above, but with targetcli
/iscsi/iqn.2016-12.com.betterservers/tpg1 set parameter InitialR2T=No

The test completed the first time, I reformatted and reran the test
(without messing with the iSCSI session) and got this in the log:

[ 1329.829689] isert: isert_put_response: Posting SCSI Response
[ 1329.829690] isert: isert_create_send_desc: tx_desc ffff88258a355680
lkey mismatch, fixing
[ 1329.829691] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x258a355680 length: 76 lkey: 0x138c
[ 1329.829691] isert: isert_put_response: Posting SCSI Response
[ 1329.829692] isert: isert_recv_done: DMA: 0x3ee953e000, iSCSI
opcode: 0x01, ITT: 0x00000127, flags: 0xa1 dlen: 4096
[ 1329.829693] isert: isert_create_send_desc: tx_desc ffff88258a3635a8
lkey mismatch, fixing
[ 1329.829693] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x258a3635a8 length: 76 lkey: 0x138c
[ 1329.829694] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 1329.829695] isert: isert_put_response: Posting SCSI Response
[ 1329.829697] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 1329.829700] isert: isert_recv_done: DMA: 0x3ee941e000, iSCSI
opcode: 0x01, ITT: 0x000000cd, flags: 0xa1 dlen: 4096
[ 1329.829701] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 1329.829702] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 1329.829705] isert: isert_create_send_desc: tx_desc ffff88258a349580
lkey mismatch, fixing
[ 1329.829705] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x258a349580 length: 76 lkey: 0x138c
[ 1329.829705] isert: isert_put_response: Posting SCSI Response
[ 1329.829706] isert: isert_create_send_desc: tx_desc ffff88258a352c48
lkey mismatch, fixing
[ 1329.829707] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x258a352c48 length: 76 lkey: 0x138c
[ 1329.829707] isert: isert_put_response: Posting SCSI Response
[ 1329.829709] isert: isert_recv_done: DMA: 0x3ee9424000, iSCSI
opcode: 0x01, ITT: 0x00000156, flags: 0xa1 dlen: 4096
[ 1329.829710] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 1329.829712] isert: isert_handle_scsi_cmd: Transfer Immediate
imm_data_len: 4096
[ 1329.829715] isert: isert_recv_done: DMA: 0x3ee948d000, iSCSI
opcode: 0x01, ITT: 0x000001ea, flags: 0xa1 dlen: 4096
[ 1329.829715] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[ 1329.829716] isert: isert_create_send_desc: tx_desc ffff88258a347150
lkey mismatch, fixing
[ 1329.829717] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x258a347150 length: 76 lkey: 0x138c
[ 1329.829717] isert: isert_put_response: Posting SCSI Response
[ 1332.815592] isert: isert_wait_conn: Starting conn ffff887f1a860000
[ 1332.822012] isert: isert_conn_terminate: Terminating conn
ffff887f1a860000 state 3
[ 1332.831289] isert: isert_cma_handler: disconnected (10): status 0
id ffff887ebc6b8c00 np ffff887f23d74a80
[ 1334.874979] isert: isert_cma_handler: connect request (4): status 0
id ffff887f24cb5c00 np ffff887f23d74a80
[ 1334.885001] isert: isert_connect_request: cma_id: ffff887f24cb5c00,
portal: ffff887f23d74a80
[ 1334.885015] isert: isert_device_get: Found iser device
ffff887f1da5cc00 refcount 2
[ 1334.892792] isert: isert_set_nego_params: Using initiator_depth: 16
[ 1334.892792] isert: isert_set_nego_params: Using remote invalidation
[ 1334.899263] isert: isert_comp_get: conn ffff887f1a862000, using
comp ffff887ebc6b9018 min_index: 1
[ 1334.908871] isert: isert_login_post_recv: Setup sge: addr:
7eb0ef4000 length: 8268 0x0000138c
[ 1334.909235] isert: isert_cma_handler: timewait exit (15): status 0
id ffff887ebc6b8c00 np ffff887f23d74a80
[ 1334.919160] isert: isert_cma_handler: established (9): status 0 id
ffff887f24cb5c00 np ffff887f23d74a80
[ 1334.928795] isert: isert_connected_handler: conn ffff887f1a862000
[ 1334.935127] isert: isert_connected_handler: np ffff887f23d74a80:
Allow accept_np to continue
[ 1334.943864] isert: isert_accept_np: Processing isert_conn:
ffff887f1a862000
[ 1334.943875] isert: isert_get_login_rx: before login_req comp conn:
ffff887f1a862000
[ 1335.143891] isert: isert_rx_login_req: conn ffff887f1a862000
[ 1335.149817] isert: isert_rx_login_req: Using login payload size:
504, rx_buflen: 504 MAX_KEY_VALUE_PAIRS: 8192
[ 1335.149818] isert: isert_get_login_rx: before login_comp conn:
ffff887f1bee7000
[ 1335.157405] isert: isert_get_login_rx: processing login->req:
ffff883f5b66afe4
[ 1335.165010] isert: isert_get_sup_prot_ops: conn ffff887f1a862000 PI
offload disabled
[ 1349.909304] isert: isert_cma_handler: disconnected (10): status 0
id ffff887f24cb5c00 np ffff887f23d74a80
[ 1350.334806] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[ 1352.162863] isert: isert_cma_handler: connect request (4): status 0
id ffff887f24cb3800 np ffff887f23d74a80
[ 1352.173132] isert: isert_connect_request: cma_id: ffff887f24cb3800,
portal: ffff887f23d74a80
[ 1352.173145] isert: isert_device_get: Found iser device
ffff887f1da5cc00 refcount 3
[ 1352.181138] isert: isert_set_nego_params: Using initiator_depth: 16
[ 1352.181139] isert: isert_set_nego_params: Using remote invalidation
[ 1352.187825] isert: isert_comp_get: conn ffff887f1a864000, using
comp ffff887ebc6b9030 min_index: 2
[ 1352.197560] isert: isert_login_post_recv: Setup sge: addr:
7eb7048000 length: 8268 0x0000138c
[ 1352.198721] isert: isert_cma_handler: established (9): status 0 id
ffff887f24cb3800 np ffff887f23d74a80
[ 1352.208547] isert: isert_connected_handler: conn ffff887f1a864000
[ 1352.215074] isert: isert_connected_handler: np ffff887f23d74a80:
Allow accept_np to continue
[ 1367.239002] isert: isert_cma_handler: disconnected (10): status 0
id ffff887f24cb3800 np ffff887f23d74a80
[ 1367.249249] isert: isert_conn_terminate: Terminating conn
ffff887f1a864000 state 1
[ 1367.258302] isert: isert_release_work: Starting release conn ffff887f1a864000
[ 1367.265965] isert: isert_release_work: Destroying conn ffff887f1a864000
[ 1367.273088] isert: isert_release_kref: conn ffff887f1a864000 final
kref kworker/u66:2/1631
[ 1367.281870] isert: isert_connect_release: conn ffff887f1a864000
[ 1367.282679] isert: isert_device_put: device ffff887f1da5cc00 refcount 2
[ 1369.492633] isert: isert_cma_handler: connect request (4): status 0
id ffff887f24cb0400 np ffff887f23d74a80
[ 1369.503049] isert: isert_connect_request: cma_id: ffff887f24cb0400,
portal: ffff887f23d74a80
[ 1369.503061] isert: isert_device_get: Found iser device
ffff887f1da5cc00 refcount 3
[ 1369.511199] isert: isert_set_nego_params: Using initiator_depth: 16
[ 1369.511199] isert: isert_set_nego_params: Using remote invalidation
[ 1369.518010] isert: isert_comp_get: conn ffff887f1a866000, using
comp ffff887ebc6b9030 min_index: 2
[ 1369.527949] isert: isert_login_post_recv: Setup sge: addr:
7f22bc0000 length: 8268 0x0000138c
[ 1369.529094] isert: isert_cma_handler: established (9): status 0 id
ffff887f24cb0400 np ffff887f23d74a80
[ 1369.539056] isert: isert_connected_handler: conn ffff887f1a866000
[ 1369.545747] isert: isert_connected_handler: np ffff887f23d74a80:
Allow accept_np to continue
[ 1384.567681] isert: isert_cma_handler: disconnected (10): status 0
id ffff887f24cb0400 np ffff887f23d74a80
[ 1384.578034] isert: isert_conn_terminate: Terminating conn
ffff887f1a866000 state 1
[ 1384.587321] isert: isert_release_work: Starting release conn ffff887f1a866000
[ 1384.595093] isert: isert_release_work: Destroying conn ffff887f1a866000
[ 1384.602328] isert: isert_release_kref: conn ffff887f1a866000 final
kref kworker/u66:0/16494
[ 1384.611337] isert: isert_connect_release: conn ffff887f1a866000
[ 1384.612370] isert: isert_device_put: device ffff887f1da5cc00 refcount 2
[ 1386.821154] isert: isert_cma_handler: connect request (4): status 0
id ffff887f24cb6c00 np ffff887f23d74a80
[ 1386.831788] isert: isert_connect_request: cma_id: ffff887f24cb6c00,
portal: ffff887f23d74a80
[ 1386.831794] isert: isert_device_get: Found iser device
ffff887f1da5cc00 refcount 3
[ 1386.840127] isert: isert_set_nego_params: Using initiator_depth: 16
[ 1386.840127] isert: isert_set_nego_params: Using remote invalidation
[ 1386.847144] isert: isert_comp_get: conn ffff887f1a864000, using
comp ffff887ebc6b9030 min_index: 2
[ 1386.857346] isert: isert_login_post_recv: Setup sge: addr:
7f22bc0000 length: 8268 0x0000138c
[ 1386.858217] isert: isert_cma_handler: established (9): status 0 id
ffff887f24cb6c00 np ffff887f23d74a80
[ 1386.868303] isert: isert_connected_handler: conn ffff887f1a864000
[ 1386.875085] isert: isert_connected_handler: np ffff887f23d74a80:
Allow accept_np to continue

>> Also, can you please summarize what kernel versions do you see this
>> with? The previous thread is a bit hard to follow at once.
>
> 4.9, 4.4.x, 4.1.x (we believe we saw it here, but it has been a long
> time since we have run this version.)
>
> Thanks for the response Sagi, I'm happy to have new things to try, I
> was really lost at where to go next.
>
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 24, 2017, 9:59 p.m. UTC | #13
On Tue, Jan 24, 2017 at 3:37 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
> Robert,
> can you please try replacing the ib_drain_qp with ib_drain_rq ? or make sure
> ib_drain_qp don't stuck.
> in the past (before v4.6) we sent "recv_beacon" on the isert disconnect
> flow.
> I suspect that area.
>
> Max.

It is getting stuck in target_wait_for_sess_cmds. ib_drain_qp is
returning just fine. I put in a bunch of debug statements and here is
a clip of them from a previous test.

[Tue Jan  3 16:55:24 2017] Starting iscsit_close_connection.
[Tue Jan  3 16:55:24 2017] Calling flush_workqueue.
[Tue Jan  3 16:55:24 2017] Returned from flush_workqueue.
[Tue Jan  3 16:55:25 2017] isert_wait_conn calling ib_close_qp/ib_drain_qp.
[Tue Jan  3 16:55:25 2017] isert_wait_conn finished ib_close_qp/ib_drain_qp.
[Tue Jan  3 16:55:25 2017] isert_wait_conn calling isert_put_unsol_pending_cmds.
[Tue Jan  3 16:55:25 2017] isert_wait_conn returned from
isert_put_unsol_pending_cmds.
[Tue Jan  3 16:55:25 2017] isert_wait_conn calling isert_wait4cmds.
[Tue Jan  3 16:55:25 2017] isert_wait4cmds calling
target_sess_cmd_list_set_waiting.
[Tue Jan  3 16:55:25 2017] isert_wait4cmds returned from
target_sess_cmd_list_set_waiting.
[Tue Jan  3 16:55:25 2017] isert_wait4cmds calling target_wait_for_sess_cmds.
[Tue Jan  3 16:55:25 2017] Waiting for se_cmd: ffff887ebf88bd00
t_state: 6, fabric state: 29
[Tue Jan  3 16:55:25 2017] target_wait_for_sess_cmds calling
spin_unlock_irqrestore.
[Tue Jan  3 16:55:25 2017] target_wait_for_sess_cmds calling
wait_for_completion.
[Tue Jan  3 16:55:26 2017] Calling wait_for_common.
[Tue Jan  3 16:55:27 2017] Starting __wait_for_common.
[Tue Jan  3 16:55:27 2017] Calling do_wait_for_common.
[Tue Jan  3 16:55:27 2017] Starting do_wait_for_common with x->done = 0.
[Tue Jan  3 16:55:27 2017] x->done = 0.
[Tue Jan  3 16:55:27 2017] Calling __set_current_state.
[Tue Jan  3 16:55:27 2017] Calling spin_unlock_irq.
[Tue Jan  3 16:55:27 2017] Calling action(7fffffffffffffff).
[Tue Jan  3 16:55:27 2017] Starting schedule_timeout with timeout
7fffffffffffffff (7fffffffffffffff).
[Tue Jan  3 16:55:27 2017] Starting schedule() in switch.
[Tue Jan  3 16:55:27 2017] Finished schedule() in switch.
[Tue Jan  3 16:55:27 2017] Calling spin_lock_irq.
[Tue Jan  3 16:55:27 2017] Calling __remove_wait_queue.
[Tue Jan  3 16:55:27 2017] Returned from __remove_wait_queue.
[Tue Jan  3 16:55:27 2017] Finished do_wait_for_common with x->done = 0.
[Tue Jan  3 16:55:27 2017] Returned from do_wait_for_common.
[Tue Jan  3 16:55:27 2017] Finished __wait_for_common.
[Tue Jan  3 16:55:27 2017] Returned from wait_for_common.
[Tue Jan  3 16:55:27 2017] Calling wait_for_common.
[Tue Jan  3 16:55:27 2017] Starting __wait_for_common.
[Tue Jan  3 16:55:27 2017] Calling do_wait_for_common.
[Tue Jan  3 16:55:27 2017] Starting do_wait_for_common with x->done = 0.
[Tue Jan  3 16:55:27 2017] x->done = 0.
[Tue Jan  3 16:55:27 2017] Calling __set_current_state.
[Tue Jan  3 16:55:27 2017] Calling spin_unlock_irq.
[Tue Jan  3 16:55:27 2017] Calling action(7fffffffffffffff).
[Tue Jan  3 16:55:27 2017] Starting schedule_timeout with timeout
7fffffffffffffff (7fffffffffffffff).
[Tue Jan  3 16:55:27 2017] Starting schedule() in switch.
[Tue Jan  3 16:55:27 2017] Finished schedule() in switch.
[Tue Jan  3 16:55:27 2017] Calling spin_lock_irq.
[Tue Jan  3 16:55:27 2017] Calling __remove_wait_queue.

If you think changing ib_drain_qp to ib_drain_rq might still help, I can do it.

Thanks,
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Jan. 24, 2017, 10:26 p.m. UTC | #14
On Tue, 2017-01-24 at 14:59 -0700, Robert LeBlanc wrote:
> It is getting stuck in target_wait_for_sess_cmds. ib_drain_qp is
> returning just fine. I put in a bunch of debug statements and here is
> a clip of them from a previous test.

The same code works fine with iSCSI/TCP, ib_srpt and qla2xxx. I would
appreciate it if an iSER could have a closer look at this.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Jan. 26, 2017, 10:52 a.m. UTC | #15
>> Robert,
>> can you please try replacing the ib_drain_qp with ib_drain_rq ? or make sure
>> ib_drain_qp don't stuck.
>> in the past (before v4.6) we sent "recv_beacon" on the isert disconnect
>> flow.
>> I suspect that area.
>>
>> Max.
>
> It is getting stuck in target_wait_for_sess_cmds. ib_drain_qp is
> returning just fine. I put in a bunch of debug statements and here is
> a clip of them from a previous test.

It won't make a difference, and we need both the rq and the sq drained
before we continue with session teardown.

Robert,

I'm still missing the information of which commands are pending without
a completion.

Please activate target_wait_for_sess_cmds() debug logging.

You can turn on the whole file with:
echo "file target_core_transport.c +p" > 
/sys/kernel/debug/dynamic_debug/control

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 26, 2017, 6:05 p.m. UTC | #16
On Thu, Jan 26, 2017 at 3:52 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>> It is getting stuck in target_wait_for_sess_cmds. ib_drain_qp is
>> returning just fine. I put in a bunch of debug statements and here is
>> a clip of them from a previous test.
>
>
> It won't make a difference, and we need both the rq and the sq drained
> before we continue with session teardown.
>
> Robert,
>
> I'm still missing the information of which commands are pending without
> a completion.
>
> Please activate target_wait_for_sess_cmds() debug logging.
>
> You can turn on the whole file with:
> echo "file target_core_transport.c +p" >
> /sys/kernel/debug/dynamic_debug/control

OK, I'm using commit bb3d5f95f46dfe19554820dbce620d31504dafb2 from the
bvanassche/scsi-target-for-v4.11 tree. I've exported the target with
InitialR2T=No then I ran
echo 4 > /sys/module/ib_isert/parameters/debug_level
echo "file target_core_transport.c +p" > /sys/kernel/debug/dynamic_debug/control

but it seems there is no new information during the period of the hang.

[  806.924011] isert: isert_recv_done: DMA: 0x7eff323000, iSCSI
opcode: 0x01, ITT: 0x000001a3, flags: 0xa1 dlen: 4096
[  806.924011] isert: isert_recv_done: ISER ISCSI_CTRL PDU
[  806.924014] isert: isert_create_send_desc: tx_desc ffff887eff0e9c30
lkey mismatch, fixing
[  806.924015] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x7eff0e9c30 length: 76 lkey: 0x425a
[  806.924016] isert: isert_put_response: Posting SCSI Response
[  806.924017] isert: isert_create_send_desc: tx_desc ffff887eff0ed278
lkey mismatch, fixing
[  806.924017] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x7eff0ed278 length: 76 lkey: 0x425a
[  806.924018] isert: isert_put_response: Posting SCSI Response
[  806.924019] isert: isert_create_send_desc: tx_desc ffff887eff0f14d0
lkey mismatch, fixing
[  806.924019] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x7eff0f14d0 length: 76 lkey: 0x425a
[  806.924020] isert: isert_put_response: Posting SCSI Response
[  806.924021] isert: isert_create_send_desc: tx_desc ffff887eff0f26e8
lkey mismatch, fixing
[  806.924022] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
0x7eff0f26e8 length: 76 lkey: 0x425a
[  806.924022] isert: isert_put_response: Posting SCSI Response
[  817.066975] isert: isert_cma_handler: disconnected (10): status 0
id ffff887efe7be400 np ffff887f1c648d80
[  817.077425] isert: isert_wait_conn: Starting conn ffff887f25f84000
[  817.084212] isert: isert_conn_terminate: Terminating conn
ffff887f25f84000 state 3
[  819.202986] isert: isert_cma_handler: timewait exit (15): status 0
id ffff887efe7be400 np ffff887f1c648d80
[  819.607370] isert: isert_cma_handler: connect request (4): status 0
id ffff887efec36400 np ffff887f1c648d80
[  819.617859] isert: isert_connect_request: cma_id: ffff887efec36400,
portal: ffff887f1c648d80
[  819.617873] isert: isert_device_get: Found iser device
ffff887f20df5f40 refcount 2
[  819.626055] isert: isert_set_nego_params: Using initiator_depth: 16
[  819.626056] isert: isert_set_nego_params: Using remote invalidation
[  819.632925] isert: isert_comp_get: conn ffff887f25f86000, using
comp ffff887efe7be818 min_index: 1
[  819.643201] isert: isert_login_post_recv: Setup sge: addr:
7ebdb50000 length: 8268 0x0000425a
[  819.644010] isert: isert_cma_handler: established (9): status 0 id
ffff887efec36400 np ffff887f1c648d80
[  819.653970] isert: isert_connected_handler: conn ffff887f25f86000
[  819.660627] isert: isert_connected_handler: np ffff887f1c648d80:
Allow accept_np to continue
[  819.669680] isert: isert_accept_np: Processing isert_conn:
ffff887f25f86000
[  819.669688] isert: isert_get_login_rx: before login_req comp conn:
ffff887f25f86000
[  819.872665] isert: isert_rx_login_req: conn ffff887f25f86000
[  819.879052] isert: isert_rx_login_req: Using login payload size:
504, rx_buflen: 504 MAX_KEY_VALUE_PAIRS: 8192
[  819.879053] isert: isert_get_login_rx: before login_comp conn:
ffff887f1a457800
[  819.886973] isert: isert_get_login_rx: processing login->req:
ffff887effcb13a4
[  819.894936] isert: isert_get_sup_prot_ops: conn ffff887f25f86000 PI
offload disabled
[  834.639402] isert: isert_cma_handler: disconnected (10): status 0
id ffff887efec36400 np ffff887f1c648d80
[  835.203732] iSCSI Login timeout on Network Portal 0.0.0.0:3260
[  836.893231] isert: isert_cma_handler: connect request (4): status 0
id ffff887efec31800 np ffff887f1c648d80
[  836.903737] isert: isert_connect_request: cma_id: ffff887efec31800,
portal: ffff887f1c648d80
[  836.903752] isert: isert_device_get: Found iser device
ffff887f20df5f40 refcount 3
[  836.911938] isert: isert_set_nego_params: Using initiator_depth: 16
[  836.911951] isert: isert_set_nego_params: Using remote invalidation
[  836.918819] isert: isert_comp_get: conn ffff887f25f82000, using
comp ffff887efe7be830 min_index: 2
[  836.928920] isert: isert_login_post_recv: Setup sge: addr:
7f11f34000 length: 8268 0x0000425a
[  836.929796] isert: isert_cma_handler: established (9): status 0 id
ffff887efec31800 np ffff887f1c648d80
[  836.939815] isert: isert_connected_handler: conn ffff887f25f82000
[  836.946522] isert: isert_connected_handler: np ffff887f1c648d80:
Allow accept_np to continue
[  851.970046] isert: isert_cma_handler: disconnected (10): status 0
id ffff887efec31800 np ffff887f1c648d80
[  851.980368] isert: isert_conn_terminate: Terminating conn
ffff887f25f82000 state 1
[  851.989703] isert: isert_release_work: Starting release conn
ffff887f25f82000
[  851.997449] isert: isert_release_work: Destroying conn
ffff887f25f82000
[  852.004665] isert: isert_release_kref: conn ffff887f25f82000 final
kref kworker/u66:0/80
[  852.013351] isert: isert_connect_release: conn ffff887f25f82000
[  852.014178] isert: isert_device_put: device ffff887f20df5f40
refcount 2
[  854.223515] isert: isert_cma_handler: connect request (4): status 0
id ffff887efec33800 np ffff887f1c648d80
[  854.234033] isert: isert_connect_request: cma_id: ffff887efec33800,
portal: ffff887f1c648d80
[  854.234039] isert: isert_device_get: Found iser device
ffff887f20df5f40 refcount 3
[  854.242271] isert: isert_set_nego_params: Using initiator_depth: 16
[  854.242272] isert: isert_set_nego_params: Using remote invalidation
[  854.249204] isert: isert_comp_get: conn ffff887f25f82000, using
comp ffff887efe7be830 min_index: 2
[  854.259359] isert: isert_login_post_recv: Setup sge: addr:
7f11f34000 length: 8268 0x0000425a
[  854.260222] isert: isert_cma_handler: established (9): status 0 id
ffff887efec33800 np ffff887f1c648d80
[  854.270220] isert: isert_connected_handler: conn ffff887f25f82000
[  854.276901] isert: isert_connected_handler: np ffff887f1c648d80:
Allow accept_np to continue
[  869.295790] isert: isert_cma_handler: disconnected (10): status 0
id ffff887efec33800 np ffff887f1c648d80

I don't think journal is able to log all the messages before the hang
if any of those are important,

Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_get_login_rx:
before login_req comp conn: ffff887f25f84000
Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_rx_login_req:
conn ffff887f25f84000
Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_get_login_rx:
before login_comp conn: ffff887f23c95800
Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_get_login_rx:
processing login->req: ffff883f6614f8e4
Jan 26 10:31:08 prv-0-13-roberttest kernel: isert:
isert_get_sup_prot_ops: conn ffff887f25f84000 PI offload disabled
Jan 26 10:31:08 prv-0-13-roberttest kernel:
iSCSI/iqn.2004-12.com.betterservers:10.91.0.14: Unsupported SCSI
Opcode 0xa3, sending CHECK_CONDITION.
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 639 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 101 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 62 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 60 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 62 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 52 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 68 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 56 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 57 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 49 kernel messages
Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 63 kernel messages
...
Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 99 kernel messages
Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 123 kernel messages
Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 216 kernel messages
Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 64 kernel messages
Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 7 kernel messages
Jan 26 10:31:34 prv-0-13-roberttest kernel: WRITE_SAME sectors: 4088
exceeds max_write_same_len: 0
Jan 26 10:31:34 prv-0-13-roberttest rsyslogd-2177: imjournal: begin to
drop messages due to rate-limiting
Jan 26 10:40:03 prv-0-13-roberttest rsyslogd-2177: imjournal: 1210115
messages lost due to rate-limiting
Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_cma_handler:
connect request (4): status 0 id ffff887efec32c00 np ffff887f1c648d80
Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_device_get:
Found iser device ffff887f20df5f40 refcount 3
Jan 26 10:40:03 prv-0-13-roberttest kernel: isert:
isert_set_nego_params: Using remote invalidation
Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_comp_get:
conn ffff887f25f82000, using comp ffff887efe7be830 min_index: 2
Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_cma_handler:
established (9): status 0 id ffff887efec32c00 np ffff887f1c648d80
Jan 26 10:40:03 prv-0-13-roberttest kernel: isert:
isert_connected_handler: conn ffff887f25f82000
Jan 26 10:40:03 prv-0-13-roberttest kernel: isert:
isert_connected_handler: np ffff887f1c648d80: Allow accept_np to
continue
Jan 26 10:40:03 prv-0-13-roberttest systemd: Removed slice user-0.slice.
Jan 26 10:40:03 prv-0-13-roberttest systemd: Stopping user-0.slice.
Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_cma_handler:
disconnected (10): status 0 id ffff887efec32c00 np ffff887f1c648d80
Jan 26 10:40:18 prv-0-13-roberttest kernel: isert:
isert_conn_terminate: Terminating conn ffff887f25f82000 state 1
Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_release_work:
Starting release conn ffff887f25f82000
Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_release_work:
Destroying conn ffff887f25f82000
Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_release_kref:
conn ffff887f25f82000 final kref kworker/u66:2/1743
Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_device_put:
device ffff887f20df5f40 refcount 2
Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_cma_handler:
connect request (4): status 0 id ffff887efec31c00 np ffff887f1c648d80
Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_device_get:
Found iser device ffff887f20df5f40 refcount 3
Jan 26 10:40:21 prv-0-13-roberttest kernel: isert:
isert_set_nego_params: Using remote invalidation
Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_comp_get:
conn ffff887f25f82000, using comp ffff887efe7be830 min_index: 2
Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_cma_handler:
established (9): status 0 id ffff887efec31c00 np ffff887f1c648d80
Jan 26 10:40:21 prv-0-13-roberttest kernel: isert:
isert_connected_handler: conn ffff887f25f82000
Jan 26 10:40:21 prv-0-13-roberttest kernel: isert:
isert_connected_handler: np ffff887f1c648d80: Allow accept_np to
continue
Jan 26 10:40:36 prv-0-13-roberttest kernel: isert: isert_cma_handler:
disconnected (10): status 0 id ffff887efec31c00 np ffff887f1c648d80
Jan 26 10:40:36 prv-0-13-roberttest kernel: isert:
isert_conn_terminate: Terminating conn ffff887f25f82000 state 1
Jan 26 10:40:36 prv-0-13-roberttest kernel: isert: isert_release_work:
Starting release conn ffff887f25f82000
Jan 26 10:40:36 prv-0-13-roberttest kernel: isert: isert_release_work:
Destroying conn ffff887f25f82000

I thought I saw some new messages in dmesg in regards to the new debug
command when I logged into the target, but I can't find them in
/var/log/messages now. I assume that I would find the messages in
dmesg, please let me know if I should be looking elsewhere.

Thanks for your help and patience.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 27, 2017, 11:18 p.m. UTC | #17
On Thu, Jan 26, 2017 at 11:05 AM, Robert LeBlanc <robert@leblancnet.us> wrote:
> On Thu, Jan 26, 2017 at 3:52 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>>> It is getting stuck in target_wait_for_sess_cmds. ib_drain_qp is
>>> returning just fine. I put in a bunch of debug statements and here is
>>> a clip of them from a previous test.
>>
>>
>> It won't make a difference, and we need both the rq and the sq drained
>> before we continue with session teardown.
>>
>> Robert,
>>
>> I'm still missing the information of which commands are pending without
>> a completion.
>>
>> Please activate target_wait_for_sess_cmds() debug logging.
>>
>> You can turn on the whole file with:
>> echo "file target_core_transport.c +p" >
>> /sys/kernel/debug/dynamic_debug/control
>
> OK, I'm using commit bb3d5f95f46dfe19554820dbce620d31504dafb2 from the
> bvanassche/scsi-target-for-v4.11 tree. I've exported the target with
> InitialR2T=No then I ran
> echo 4 > /sys/module/ib_isert/parameters/debug_level
> echo "file target_core_transport.c +p" > /sys/kernel/debug/dynamic_debug/control
>
> but it seems there is no new information during the period of the hang.
>
> [  806.924011] isert: isert_recv_done: DMA: 0x7eff323000, iSCSI
> opcode: 0x01, ITT: 0x000001a3, flags: 0xa1 dlen: 4096
> [  806.924011] isert: isert_recv_done: ISER ISCSI_CTRL PDU
> [  806.924014] isert: isert_create_send_desc: tx_desc ffff887eff0e9c30
> lkey mismatch, fixing
> [  806.924015] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
> 0x7eff0e9c30 length: 76 lkey: 0x425a
> [  806.924016] isert: isert_put_response: Posting SCSI Response
> [  806.924017] isert: isert_create_send_desc: tx_desc ffff887eff0ed278
> lkey mismatch, fixing
> [  806.924017] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
> 0x7eff0ed278 length: 76 lkey: 0x425a
> [  806.924018] isert: isert_put_response: Posting SCSI Response
> [  806.924019] isert: isert_create_send_desc: tx_desc ffff887eff0f14d0
> lkey mismatch, fixing
> [  806.924019] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
> 0x7eff0f14d0 length: 76 lkey: 0x425a
> [  806.924020] isert: isert_put_response: Posting SCSI Response
> [  806.924021] isert: isert_create_send_desc: tx_desc ffff887eff0f26e8
> lkey mismatch, fixing
> [  806.924022] isert: isert_init_tx_hdrs: Setup tx_sg[0].addr:
> 0x7eff0f26e8 length: 76 lkey: 0x425a
> [  806.924022] isert: isert_put_response: Posting SCSI Response
> [  817.066975] isert: isert_cma_handler: disconnected (10): status 0
> id ffff887efe7be400 np ffff887f1c648d80
> [  817.077425] isert: isert_wait_conn: Starting conn ffff887f25f84000
> [  817.084212] isert: isert_conn_terminate: Terminating conn
> ffff887f25f84000 state 3
> [  819.202986] isert: isert_cma_handler: timewait exit (15): status 0
> id ffff887efe7be400 np ffff887f1c648d80
> [  819.607370] isert: isert_cma_handler: connect request (4): status 0
> id ffff887efec36400 np ffff887f1c648d80
> [  819.617859] isert: isert_connect_request: cma_id: ffff887efec36400,
> portal: ffff887f1c648d80
> [  819.617873] isert: isert_device_get: Found iser device
> ffff887f20df5f40 refcount 2
> [  819.626055] isert: isert_set_nego_params: Using initiator_depth: 16
> [  819.626056] isert: isert_set_nego_params: Using remote invalidation
> [  819.632925] isert: isert_comp_get: conn ffff887f25f86000, using
> comp ffff887efe7be818 min_index: 1
> [  819.643201] isert: isert_login_post_recv: Setup sge: addr:
> 7ebdb50000 length: 8268 0x0000425a
> [  819.644010] isert: isert_cma_handler: established (9): status 0 id
> ffff887efec36400 np ffff887f1c648d80
> [  819.653970] isert: isert_connected_handler: conn ffff887f25f86000
> [  819.660627] isert: isert_connected_handler: np ffff887f1c648d80:
> Allow accept_np to continue
> [  819.669680] isert: isert_accept_np: Processing isert_conn:
> ffff887f25f86000
> [  819.669688] isert: isert_get_login_rx: before login_req comp conn:
> ffff887f25f86000
> [  819.872665] isert: isert_rx_login_req: conn ffff887f25f86000
> [  819.879052] isert: isert_rx_login_req: Using login payload size:
> 504, rx_buflen: 504 MAX_KEY_VALUE_PAIRS: 8192
> [  819.879053] isert: isert_get_login_rx: before login_comp conn:
> ffff887f1a457800
> [  819.886973] isert: isert_get_login_rx: processing login->req:
> ffff887effcb13a4
> [  819.894936] isert: isert_get_sup_prot_ops: conn ffff887f25f86000 PI
> offload disabled
> [  834.639402] isert: isert_cma_handler: disconnected (10): status 0
> id ffff887efec36400 np ffff887f1c648d80
> [  835.203732] iSCSI Login timeout on Network Portal 0.0.0.0:3260
> [  836.893231] isert: isert_cma_handler: connect request (4): status 0
> id ffff887efec31800 np ffff887f1c648d80
> [  836.903737] isert: isert_connect_request: cma_id: ffff887efec31800,
> portal: ffff887f1c648d80
> [  836.903752] isert: isert_device_get: Found iser device
> ffff887f20df5f40 refcount 3
> [  836.911938] isert: isert_set_nego_params: Using initiator_depth: 16
> [  836.911951] isert: isert_set_nego_params: Using remote invalidation
> [  836.918819] isert: isert_comp_get: conn ffff887f25f82000, using
> comp ffff887efe7be830 min_index: 2
> [  836.928920] isert: isert_login_post_recv: Setup sge: addr:
> 7f11f34000 length: 8268 0x0000425a
> [  836.929796] isert: isert_cma_handler: established (9): status 0 id
> ffff887efec31800 np ffff887f1c648d80
> [  836.939815] isert: isert_connected_handler: conn ffff887f25f82000
> [  836.946522] isert: isert_connected_handler: np ffff887f1c648d80:
> Allow accept_np to continue
> [  851.970046] isert: isert_cma_handler: disconnected (10): status 0
> id ffff887efec31800 np ffff887f1c648d80
> [  851.980368] isert: isert_conn_terminate: Terminating conn
> ffff887f25f82000 state 1
> [  851.989703] isert: isert_release_work: Starting release conn
> ffff887f25f82000
> [  851.997449] isert: isert_release_work: Destroying conn
> ffff887f25f82000
> [  852.004665] isert: isert_release_kref: conn ffff887f25f82000 final
> kref kworker/u66:0/80
> [  852.013351] isert: isert_connect_release: conn ffff887f25f82000
> [  852.014178] isert: isert_device_put: device ffff887f20df5f40
> refcount 2
> [  854.223515] isert: isert_cma_handler: connect request (4): status 0
> id ffff887efec33800 np ffff887f1c648d80
> [  854.234033] isert: isert_connect_request: cma_id: ffff887efec33800,
> portal: ffff887f1c648d80
> [  854.234039] isert: isert_device_get: Found iser device
> ffff887f20df5f40 refcount 3
> [  854.242271] isert: isert_set_nego_params: Using initiator_depth: 16
> [  854.242272] isert: isert_set_nego_params: Using remote invalidation
> [  854.249204] isert: isert_comp_get: conn ffff887f25f82000, using
> comp ffff887efe7be830 min_index: 2
> [  854.259359] isert: isert_login_post_recv: Setup sge: addr:
> 7f11f34000 length: 8268 0x0000425a
> [  854.260222] isert: isert_cma_handler: established (9): status 0 id
> ffff887efec33800 np ffff887f1c648d80
> [  854.270220] isert: isert_connected_handler: conn ffff887f25f82000
> [  854.276901] isert: isert_connected_handler: np ffff887f1c648d80:
> Allow accept_np to continue
> [  869.295790] isert: isert_cma_handler: disconnected (10): status 0
> id ffff887efec33800 np ffff887f1c648d80
>
> I don't think journal is able to log all the messages before the hang
> if any of those are important,
>
> Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_get_login_rx:
> before login_req comp conn: ffff887f25f84000
> Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_rx_login_req:
> conn ffff887f25f84000
> Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_get_login_rx:
> before login_comp conn: ffff887f23c95800
> Jan 26 10:31:08 prv-0-13-roberttest kernel: isert: isert_get_login_rx:
> processing login->req: ffff883f6614f8e4
> Jan 26 10:31:08 prv-0-13-roberttest kernel: isert:
> isert_get_sup_prot_ops: conn ffff887f25f84000 PI offload disabled
> Jan 26 10:31:08 prv-0-13-roberttest kernel:
> iSCSI/iqn.2004-12.com.betterservers:10.91.0.14: Unsupported SCSI
> Opcode 0xa3, sending CHECK_CONDITION.
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 639 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 101 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 62 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 60 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 62 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 52 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 68 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 56 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 57 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 49 kernel messages
> Jan 26 10:31:30 prv-0-13-roberttest journal: Missed 63 kernel messages
> ...
> Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 99 kernel messages
> Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 123 kernel messages
> Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 216 kernel messages
> Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 64 kernel messages
> Jan 26 10:31:31 prv-0-13-roberttest journal: Missed 7 kernel messages
> Jan 26 10:31:34 prv-0-13-roberttest kernel: WRITE_SAME sectors: 4088
> exceeds max_write_same_len: 0
> Jan 26 10:31:34 prv-0-13-roberttest rsyslogd-2177: imjournal: begin to
> drop messages due to rate-limiting
> Jan 26 10:40:03 prv-0-13-roberttest rsyslogd-2177: imjournal: 1210115
> messages lost due to rate-limiting
> Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_cma_handler:
> connect request (4): status 0 id ffff887efec32c00 np ffff887f1c648d80
> Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_device_get:
> Found iser device ffff887f20df5f40 refcount 3
> Jan 26 10:40:03 prv-0-13-roberttest kernel: isert:
> isert_set_nego_params: Using remote invalidation
> Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_comp_get:
> conn ffff887f25f82000, using comp ffff887efe7be830 min_index: 2
> Jan 26 10:40:03 prv-0-13-roberttest kernel: isert: isert_cma_handler:
> established (9): status 0 id ffff887efec32c00 np ffff887f1c648d80
> Jan 26 10:40:03 prv-0-13-roberttest kernel: isert:
> isert_connected_handler: conn ffff887f25f82000
> Jan 26 10:40:03 prv-0-13-roberttest kernel: isert:
> isert_connected_handler: np ffff887f1c648d80: Allow accept_np to
> continue
> Jan 26 10:40:03 prv-0-13-roberttest systemd: Removed slice user-0.slice.
> Jan 26 10:40:03 prv-0-13-roberttest systemd: Stopping user-0.slice.
> Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_cma_handler:
> disconnected (10): status 0 id ffff887efec32c00 np ffff887f1c648d80
> Jan 26 10:40:18 prv-0-13-roberttest kernel: isert:
> isert_conn_terminate: Terminating conn ffff887f25f82000 state 1
> Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_release_work:
> Starting release conn ffff887f25f82000
> Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_release_work:
> Destroying conn ffff887f25f82000
> Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_release_kref:
> conn ffff887f25f82000 final kref kworker/u66:2/1743
> Jan 26 10:40:18 prv-0-13-roberttest kernel: isert: isert_device_put:
> device ffff887f20df5f40 refcount 2
> Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_cma_handler:
> connect request (4): status 0 id ffff887efec31c00 np ffff887f1c648d80
> Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_device_get:
> Found iser device ffff887f20df5f40 refcount 3
> Jan 26 10:40:21 prv-0-13-roberttest kernel: isert:
> isert_set_nego_params: Using remote invalidation
> Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_comp_get:
> conn ffff887f25f82000, using comp ffff887efe7be830 min_index: 2
> Jan 26 10:40:21 prv-0-13-roberttest kernel: isert: isert_cma_handler:
> established (9): status 0 id ffff887efec31c00 np ffff887f1c648d80
> Jan 26 10:40:21 prv-0-13-roberttest kernel: isert:
> isert_connected_handler: conn ffff887f25f82000
> Jan 26 10:40:21 prv-0-13-roberttest kernel: isert:
> isert_connected_handler: np ffff887f1c648d80: Allow accept_np to
> continue
> Jan 26 10:40:36 prv-0-13-roberttest kernel: isert: isert_cma_handler:
> disconnected (10): status 0 id ffff887efec31c00 np ffff887f1c648d80
> Jan 26 10:40:36 prv-0-13-roberttest kernel: isert:
> isert_conn_terminate: Terminating conn ffff887f25f82000 state 1
> Jan 26 10:40:36 prv-0-13-roberttest kernel: isert: isert_release_work:
> Starting release conn ffff887f25f82000
> Jan 26 10:40:36 prv-0-13-roberttest kernel: isert: isert_release_work:
> Destroying conn ffff887f25f82000
>
> I thought I saw some new messages in dmesg in regards to the new debug
> command when I logged into the target, but I can't find them in
> /var/log/messages now. I assume that I would find the messages in
> dmesg, please let me know if I should be looking elsewhere.
>
> Thanks for your help and patience.
>
> ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

Sagi,

Just an FYI we hit the D state in our production Infiniband yesterday
pretty reliably. Since it didn't seem that I have been getting useful
info from my dev environment with the last couple of emails, I did't
try to get any debug from our production environment. We need to run
4.4.x in our production environment mostly for ZFS so when we get
something that is showing what the problem is, we will try to backport
the code to 4.4.x and try to get it to trigger on our Infiniband gear
and if we can get it into that state and we have a proposed fix, we
can verify it works there too. When we trigger the situation, we can
usually reboot the target and it will retrigger until we clean up the
initiator, usually shutting down all the VMs and logging out of the
iser sessions. So there may be some bug on the initiator side as well
that is holding onto some bad session data, but I think we should
continue to focus on the target bug as it is the most disruptive.

Thanks,
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Jan. 29, 2017, 7:42 a.m. UTC | #18
> OK, I'm using commit bb3d5f95f46dfe19554820dbce620d31504dafb2 from the
> bvanassche/scsi-target-for-v4.11 tree. I've exported the target with
> InitialR2T=No

Hey Robert, InitialR2T=No means that you are using UnsolicitedDataOut, I
recommend to not use it.

  then I ran
> echo 4 > /sys/module/ib_isert/parameters/debug_level
> echo "file target_core_transport.c +p" > /sys/kernel/debug/dynamic_debug/control
>
> but it seems there is no new information during the period of the hang.

IIRC, pr_debug are not directed to the journal by default, can you
please check dmesg output?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 29, 2017, 7:13 p.m. UTC | #19
On Sun, Jan 29, 2017 at 12:42 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>> OK, I'm using commit bb3d5f95f46dfe19554820dbce620d31504dafb2 from the
>> bvanassche/scsi-target-for-v4.11 tree. I've exported the target with
>> InitialR2T=No
>
>
> Hey Robert, InitialR2T=No means that you are using UnsolicitedDataOut, I
> recommend to not use it.

This doesn't seem to make a difference in regards to hitting the bug,
but I'll set it back to on for all other tests.

>  then I ran
>>
>> echo 4 > /sys/module/ib_isert/parameters/debug_level
>> echo "file target_core_transport.c +p" >
>> /sys/kernel/debug/dynamic_debug/control
>>
>> but it seems there is no new information during the period of the hang.
>
>
> IIRC, pr_debug are not directed to the journal by default, can you
> please check dmesg output?

The first output (without the dates) is from dmesg, the second output
is from journal.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Jan. 30, 2017, 7:39 a.m. UTC | #20
> The first output (without the dates) is from dmesg, the second output
> is from journal.

Hey Robert,

I'm still having a hard time understanding the traces.
How about you reproduce the error and attach a full dmesg
output? I suggest to reproduce once with debug_level=2 (to
cleanup some noise) and once with debug_level=4 and hopefully
we can understand what is getting lost.

Also, please wait for the hang timeout watchdog traces, they
can give us confirmation that we are stuck waiting for a final
reference on some command.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Jan. 30, 2017, 7:36 p.m. UTC | #21
On Mon, Jan 30, 2017 at 12:39 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>> The first output (without the dates) is from dmesg, the second output
>> is from journal.
>
>
> Hey Robert,
>
> I'm still having a hard time understanding the traces.
> How about you reproduce the error and attach a full dmesg
> output? I suggest to reproduce once with debug_level=2 (to
> cleanup some noise) and once with debug_level=4 and hopefully
> we can understand what is getting lost.
>
> Also, please wait for the hang timeout watchdog traces, they
> can give us confirmation that we are stuck waiting for a final
> reference on some command.

I left the capture of dmesg run for ~20 minutes after iSER failed.
2017-01-30_11:37:28_dmesg.log is level 2 and
2017-01-30_12:04:47_dmesg.log is level 4. The log files can be found
at [1].

[1] http://mirrors.betterservers.com/trace/dmesg_logs.txz
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Robert LeBlanc Feb. 1, 2017, 5:56 p.m. UTC | #22
On Mon, Jan 30, 2017 at 12:36 PM, Robert LeBlanc <robert@leblancnet.us> wrote:
> On Mon, Jan 30, 2017 at 12:39 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>>> The first output (without the dates) is from dmesg, the second output
>>> is from journal.
>>
>>
>> Hey Robert,
>>
>> I'm still having a hard time understanding the traces.
>> How about you reproduce the error and attach a full dmesg
>> output? I suggest to reproduce once with debug_level=2 (to
>> cleanup some noise) and once with debug_level=4 and hopefully
>> we can understand what is getting lost.
>>
>> Also, please wait for the hang timeout watchdog traces, they
>> can give us confirmation that we are stuck waiting for a final
>> reference on some command.
>
> I left the capture of dmesg run for ~20 minutes after iSER failed.
> 2017-01-30_11:37:28_dmesg.log is level 2 and
> 2017-01-30_12:04:47_dmesg.log is level 4. The log files can be found
> at [1].
>
> [1] http://mirrors.betterservers.com/trace/dmesg_logs.txz


I saw a patch from Max [1] about a related issue so I gave it a try
and it didn't help. I'm linking to the logs in case you want to look
at them as well. [2]

[1] "iSER fails to release rdma resources (WRs) if iw_cxgb4 is
unloaded while IO is in progress
https://www.spinics.net/lists/linux-rdma/msg45799.html
[2] http://mirrors.betterservers.com/trace/2017_02_01_dmesg.txz
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 8368764..ed36748 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2089,3 +2089,19 @@  void ib_drain_qp(struct ib_qp *qp)
               ib_drain_rq(qp);
}
EXPORT_SYMBOL(ib_drain_qp);
+
+void ib_reset_sq(struct ib_qp *qp)
+{
+       struct ib_qp_attr attr = { .qp_state = IB_QPS_RESET};
+       int ret;
+
+       ret = ib_modify_qp(qp, &attr, IB_QP_STATE);
+}
+EXPORT_SYMBOL(ib_reset_sq);
+
+void ib_reset_qp(struct ib_qp *qp)
+{
+       printk("ib_reset_qp calling ib_reset_sq.\n");
+       ib_reset_sq(qp);
+}
+EXPORT_SYMBOL(ib_reset_qp);
diff --git a/drivers/infiniband/ulp/isert/ib_isert.c
b/drivers/infiniband/ulp/isert/ib_isert.c
index 6dd43f6..619dbc7 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -2595,10 +2595,9 @@  static void isert_wait_conn(struct iscsi_conn *conn)
       isert_conn_terminate(isert_conn);
       mutex_unlock(&isert_conn->mutex);

-       ib_drain_qp(isert_conn->qp);
+       ib_reset_qp(isert_conn->qp);
       isert_put_unsol_pending_cmds(conn);
-       isert_wait4cmds(conn);
-       isert_wait4logout(isert_conn);
+       cancel_work_sync(&isert_conn->release_work);

       queue_work(isert_release_wq, &isert_conn->release_work);
}
@@ -2607,7 +2606,7 @@  static void isert_free_conn(struct iscsi_conn *conn)
{
       struct isert_conn *isert_conn = conn->context;

-       ib_drain_qp(isert_conn->qp);
+       ib_close_qp(isert_conn->qp);
       isert_put_conn(isert_conn);
}

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 5ad43a4..3310c37 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -3357,4 +3357,6 @@  int ib_sg_to_pages(struct ib_mr *mr, struct
scatterlist *sgl, int sg_nents,
void ib_drain_rq(struct ib_qp *qp);
void ib_drain_sq(struct ib_qp *qp);
void ib_drain_qp(struct ib_qp *qp);
+void ib_reset_sq(struct ib_qp *qp);
+void ib_reset_qp(struct ib_qp *qp);