Message ID | 1301607843.30852658.1487021644535.JavaMail.zimbra@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
----- Original Message ----- > From: "Laurence Oberman" <loberman@redhat.com> > To: "Leon Romanovsky" <leon@kernel.org> > Cc: "Bart Van Assche" <Bart.VanAssche@sandisk.com>, hch@lst.de, maxg@mellanox.com, israelr@mellanox.com, > linux-rdma@vger.kernel.org, dledford@redhat.com > Sent: Monday, February 13, 2017 4:34:04 PM > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a QP > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman@redhat.com> > > To: "Leon Romanovsky" <leon@kernel.org> > > Cc: "Bart Van Assche" <Bart.VanAssche@sandisk.com>, hch@lst.de, > > maxg@mellanox.com, israelr@mellanox.com, > > linux-rdma@vger.kernel.org, dledford@redhat.com > > Sent: Monday, February 13, 2017 11:47:31 AM > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > QP > > > > > > > > ----- Original Message ----- > > > From: "Laurence Oberman" <loberman@redhat.com> > > > To: "Leon Romanovsky" <leon@kernel.org> > > > Cc: "Bart Van Assche" <Bart.VanAssche@sandisk.com>, hch@lst.de, > > > maxg@mellanox.com, israelr@mellanox.com, > > > linux-rdma@vger.kernel.org, dledford@redhat.com > > > Sent: Monday, February 13, 2017 11:12:55 AM > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying a > > > QP > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Laurence Oberman" <loberman@redhat.com> > > > > To: "Leon Romanovsky" <leon@kernel.org> > > > > Cc: "Bart Van Assche" <Bart.VanAssche@sandisk.com>, hch@lst.de, > > > > maxg@mellanox.com, israelr@mellanox.com, > > > > linux-rdma@vger.kernel.org, dledford@redhat.com > > > > Sent: Monday, February 13, 2017 9:24:01 AM > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before destroying > > > > a > > > > QP > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Leon Romanovsky" <leon@kernel.org> > > > > > To: "Laurence Oberman" <loberman@redhat.com> > > > > > Cc: "Bart Van Assche" <Bart.VanAssche@sandisk.com>, hch@lst.de, > > > > > maxg@mellanox.com, israelr@mellanox.com, > > > > > linux-rdma@vger.kernel.org, dledford@redhat.com > > > > > Sent: Monday, February 13, 2017 9:17:24 AM > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > destroying > > > > > a > > > > > QP > > > > > > > > > > On Mon, Feb 13, 2017 at 08:54:53AM -0500, Laurence Oberman wrote: > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Laurence Oberman" <loberman@redhat.com> > > > > > > > To: "Bart Van Assche" <Bart.VanAssche@sandisk.com> > > > > > > > Cc: leon@kernel.org, hch@lst.de, maxg@mellanox.com, > > > > > > > israelr@mellanox.com, > > > > > > > linux-rdma@vger.kernel.org, > > > > > > > dledford@redhat.com > > > > > > > Sent: Sunday, February 12, 2017 10:14:53 PM > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > destroying > > > > > > > a > > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Laurence Oberman" <loberman@redhat.com> > > > > > > > > To: "Bart Van Assche" <Bart.VanAssche@sandisk.com> > > > > > > > > Cc: leon@kernel.org, hch@lst.de, maxg@mellanox.com, > > > > > > > > israelr@mellanox.com, > > > > > > > > linux-rdma@vger.kernel.org, > > > > > > > > dledford@redhat.com > > > > > > > > Sent: Sunday, February 12, 2017 9:07:16 PM > > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > > destroying > > > > > > > > a > > > > > > > > QP > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Bart Van Assche" <Bart.VanAssche@sandisk.com> > > > > > > > > > To: leon@kernel.org, loberman@redhat.com > > > > > > > > > Cc: hch@lst.de, maxg@mellanox.com, israelr@mellanox.com, > > > > > > > > > linux-rdma@vger.kernel.org, dledford@redhat.com > > > > > > > > > Sent: Sunday, February 12, 2017 3:05:16 PM > > > > > > > > > Subject: Re: [PATCH 8/8] IB/srp: Drain the send queue before > > > > > > > > > destroying a > > > > > > > > > QP > > > > > > > > > > > > > > > > > > On Sun, 2017-02-12 at 13:02 -0500, Laurence Oberman wrote: > > > > > > > > > > [ 861.143141] WARNING: CPU: 27 PID: 1103 at > > > > > > > > > > drivers/infiniband/core/verbs.c:1959 > > > > > > > > > > __ib_drain_sq+0x1bb/0x1c0 > > > > > > > > > > [ib_core] > > > > > > > > > > [ 861.202208] IB_POLL_DIRECT poll_ctx not supported for > > > > > > > > > > drain > > > > > > > > > > > > > > > > > > Hello Laurence, > > > > > > > > > > > > > > > > > > That warning has been removed by patch 7/8 of this series. > > > > > > > > > Please > > > > > > > > > double > > > > > > > > > check > > > > > > > > > whether all eight patches have been applied properly. > > > > > > > > > > > > > > > > > > Bart.N�����r��y���b�X��ǧv�^�){.n�+����{��ٚ�{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"�� > > > > > > > > > > > > > > > > Hello > > > > > > > > Just a heads up, working with Bart on this patch series. > > > > > > > > We have stability issues with my tests in my MLX5 EDR-100 test > > > > > > > > bed. > > > > > > > > Thanks > > > > > > > > Laurence > > > > > > > > -- > > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > > linux-rdma" > > > > > > > > in > > > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > > > More majordomo info at > > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > > > > > I went back to Linus' latest tree for a baseline and we fail the > > > > > > > same > > > > > > > way. > > > > > > > This has none of the latest 8 patches applied so we will > > > > > > > have to figure out what broke this. > > > > > > > > > > > > > > Dont forget that I tested all this recently with Bart's dma patch > > > > > > > series > > > > > > > and its solid. > > > > > > > > > > > > > > Will come back to this tomorrow and see what recently made it > > > > > > > into > > > > > > > Linus's > > > > > > > tree by > > > > > > > checking back with Doug. > > > > > > > > > > > > > > [ 183.779175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff880bd4270eb0 > > > > > > > [ 183.853047] 00000000 00000000 00000000 00000000 > > > > > > > [ 183.878425] 00000000 00000000 00000000 00000000 > > > > > > > [ 183.903243] 00000000 00000000 00000000 00000000 > > > > > > > [ 183.928518] 00000000 0f007806 2500002a ad9fafd1 > > > > > > > [ 198.538593] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 198.573141] mlx5_0:dump_cqe:262:(pid 7369): dump error cqe > > > > > > > [ 198.603037] 00000000 00000000 00000000 00000000 > > > > > > > [ 198.628884] 00000000 00000000 00000000 00000000 > > > > > > > [ 198.653961] 00000000 00000000 00000000 00000000 > > > > > > > [ 198.680021] 00000000 0f007806 25000032 00105dd0 > > > > > > > [ 198.705985] scsi host1: ib_srp: failed FAST REG status memory > > > > > > > management > > > > > > > operation error (6) for CQE ffff880b92860138 > > > > > > > [ 213.532848] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 213.568828] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 227.579684] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 227.616175] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 242.633925] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 242.668160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 257.127715] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 257.165623] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 272.225762] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 272.262570] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 286.350226] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 286.386160] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 301.109365] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 301.144930] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 315.910860] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 315.944594] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 330.551052] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 330.584552] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 344.998448] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 345.032115] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 359.866731] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 359.902114] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > .. > > > > > > > .. > > > > > > > [ 373.113045] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 373.149511] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 388.401469] fast_io_fail_tmo expired for SRP port-1:1 / host1. > > > > > > > [ 388.589517] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 388.623462] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 403.086893] scsi host1: ib_srp: reconnect succeeded > > > > > > > [ 403.120876] scsi host1: ib_srp: failed RECV status WR flushed > > > > > > > (5) > > > > > > > for > > > > > > > CQE > > > > > > > ffff8817f2234c30 > > > > > > > [ 403.140401] mlx5_0:dump_cqe:262:(pid 749): dump error cqe > > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > > [ 403.140402] 00000000 00000000 00000000 00000000 > > > > > > > [ 403.140403] 00000000 00000000 00000000 00000000 > > > > > > > [ 403.140403] 00 > > > > > > > > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > linux-rdma" > > > > > > > in > > > > > > > the body of a message to majordomo@vger.kernel.org > > > > > > > More majordomo info at > > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > Hello > > > > > > > > > > > > Let summarize where we are and how we got here. > > > > > > > > > > > > The last kernel I tested with mlx5 and ib_srp was > > > > > > vmlinuz-4.10.0-rc4 > > > > > > with > > > > > > Barts dma patches. > > > > > > All tests passed. > > > > > > > > > > > > I pulled Linus's tree and applied all 8 patches of the above series > > > > > > and > > > > > > we > > > > > > failed in the > > > > > > "failed FAST REG status memory management" area. > > > > > > > > > > > > I applied only 7 of the 8 patches to Linus's tree because Bart and > > > > > > I > > > > > > thought patch 6 of the series > > > > > > may have been the catalyst. > > > > > > > > > > > > This also failed. > > > > > > > > > > > > Building from Barts tree which is based on 4.10.0-rc7 failed again. > > > > > > > > > > > > This made me decide to baseline Linus's tree 4.10.0-rc7 and we > > > > > > fail. > > > > > > > > > > > > So something has crept into 4.10.0-rc7 affecting this with mlx5 and > > > > > > ib_srp. > > > > > > > > > > From infiniband side: > > > > > ➜ linux-rdma git:(queue-next) git log v4.10-rc4...v4.10-rc7 -- > > > > > drivers/inifiniband |wc > > > > > 0 0 0 > > > > > > > > > > From eth nothing suspicious too: > > > > > ➜ linux-rdma git:(queue-next) git l v4.10-rc4...v4.10-rc7 -- > > > > > drivers/net/ethernet/mellanox/mlx5 > > > > > d15118af2683 net/mlx5e: Check ets capability before ets query FW > > > > > command > > > > > a100ff3eef19 net/mlx5e: Fix update of hash function/key via ethtool > > > > > 1d3398facd08 net/mlx5e: Modify TIRs hash only when it's needed > > > > > 3e621b19b0bb net/mlx5e: Support TC encapsulation offloads with upper > > > > > devices > > > > > 5bae8c031053 net/mlx5: E-Switch, Re-enable RoCE on mode change only > > > > > after > > > > > FDB > > > > > destroy > > > > > 5403dc703ff2 net/mlx5: E-Switch, Err when retrieving steering > > > > > name-space > > > > > fails > > > > > eff596da4878 net/mlx5: Return EOPNOTSUPP when failing to get steering > > > > > name-space > > > > > 9eb7892351a3 net/mlx5: Change ENOTSUPP to EOPNOTSUPP > > > > > e048fc50d7bd net/mlx5e: Do not recycle pages from emergency reserve > > > > > ad05df399f33 net/mlx5e: Remove unused variable > > > > > 639e9e94160e net/mlx5e: Remove unnecessary checks when setting num > > > > > channels > > > > > abeffce90c7f net/mlx5e: Fix a -Wmaybe-uninitialized warning > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > Laurence > > > > > > > > > > > > > Hi Leon, > > > > Yep, I also looked for outliers here that may look suspicious and did > > > > not > > > > see > > > > any. > > > > > > > > I guess I will have to start bisecting. > > > > I will start with rc5, if that fails will bisect between rc4 and rc5, > > > > as > > > > we > > > > know rc4 was fine. > > > > > > > > I did re-run tests on rc4 last night and I was stable. > > > > > > > > Thanks > > > > Laurence > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > > > > in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > OK, so 4.10.0-rc5 is fine, 4.10.0-rc6 fails, so will start bisecting. > > > Unless one of you think you know what may be causing this in rc6. > > > This will take time so will come back to the list once I have it > > > isolated. > > > > > > Thanks > > > Laurence > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > Bisect has 8 possible kernel builds, 200 + changes, started the first one. > > > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > Hello > > Bisecting got me to this commit, I had reviewed this looking for an > explanation at some point. > At the time, I did not understand the need for the change but after > explanation I accepted it. > I reverted this and we are good again but reading the code, not seeing how > this is affecting us. > > Makes no sense how this can be the issue. > > Nevertheless we will need to revert this please. > > I will now apply the 8 patches from Bart to Linus's tree with this reverted > and test again. > > Bisect run > > git bisect start > git bisect bad 566cf877a1fcb6d6dc0126b076aad062054c2637 > git bisect good 7a308bb3016f57e5be11a677d15b821536419d36 > git bisect good > git bisect good > git bisect bad > git bisect bad > git bisect bad > git bisect bad > git bisect good > > Bisecting: 0 revisions left to test after this (roughly 1 step) > [0a475ef4226e305bdcffe12b401ca1eab06c4913] IB/srp: fix invalid > indirect_sg_entries parameter value > [loberman@ibclient linux-torvalds]$ git show > 0a475ef4226e305bdcffe12b401ca1eab06c4913 > commit 0a475ef4226e305bdcffe12b401ca1eab06c4913 > Author: Israel Rukshin <israelr@mellanox.com> > Date: Wed Jan 4 15:59:37 2017 +0200 > > IB/srp: fix invalid indirect_sg_entries parameter value > > After setting indirect_sg_entries module_param to huge value (e.g > 500,000), > srp_alloc_req_data() fails to allocate indirect descriptors for the > request > ring (kmalloc fails). This commit enforces the maximum value of > indirect_sg_entries to be SG_MAX_SEGMENTS as signified in module param > description. > > Fixes: 65e8617fba17 (scsi: rename SCSI_MAX_{SG, SG_CHAIN}_SEGMENTS) > Fixes: c07d424d6118 (IB/srp: add support for indirect tables that don't > fit in SRP_CMD) > Cc: stable@vger.kernel.org # 4.7+ > Signed-off-by: Israel Rukshin <israelr@mellanox.com> > Signed-off-by: Max Gurtovoy <maxg@mellanox.com> > Reviewed-by: Laurence Oberman <loberman@redhat.com> > Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>-- > Signed-off-by: Doug Ledford <dledford@redhat.com> > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > b/drivers/infiniband/ulp/srp/ib_srp.c > index 0f67cf9..79bf484 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -3699,6 +3699,12 @@ static int __init srp_init_module(void) > indirect_sg_entries = cmd_sg_entries; > } > > + if (indirect_sg_entries > SG_MAX_SEGMENTS) { > + pr_warn("Clamping indirect_sg_entries to %u\n", > + SG_MAX_SEGMENTS); > + indirect_sg_entries = SG_MAX_SEGMENTS; > + } > + > srp_remove_wq = create_workqueue("srp_remove"); > if (!srp_remove_wq) { > ret = -ENOMEM; > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello The revert actually does not help. it failed after a while. This mail was in drafts while I was testing and it got sent and should not have been. The revert does not help which I am happy about because it made no sense. So not sure how the bisect got me here but it did. I will have to run through this again and see where the bisect went wrong. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote:
> I will have to run through this again and see where the bisect went wrong.
Hello Laurence,
If you would be considering to repeat the bisect, did you know that a bisect
can be sped up by specifying the names of the files and/or directories that
are suspected? An example:
git bisect start */infiniband */net
Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > From: "Bart Van Assche" <Bart.VanAssche@sandisk.com> > To: leon@kernel.org, loberman@redhat.com > Cc: hch@lst.de, maxg@mellanox.com, israelr@mellanox.com, linux-rdma@vger.kernel.org, dledford@redhat.com > Sent: Monday, February 13, 2017 4:52:28 PM > Subject: Re: v4.10-rc SRP + mlx5 regression > > On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > > I will have to run through this again and see where the bisect went wrong. > > Hello Laurence, > > If you would be considering to repeat the bisect, did you know that a bisect > can be sped up by specifying the names of the files and/or directories that > are suspected? An example: > > git bisect start */infiniband */net > > Bart.-- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bart I will try that, I knew it was possible it but had not used it before so wanted to be careful. Even being careful something went wrong :) I was very careful and I waited in between tests to give it long enough. Perhaps I said good when bad or something like that. I will use your method and by tomorrow I should have this figured out for you. Thanks Laurence -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 0f67cf9..79bf484 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -3699,6 +3699,12 @@ static int __init srp_init_module(void) indirect_sg_entries = cmd_sg_entries; } + if (indirect_sg_entries > SG_MAX_SEGMENTS) { + pr_warn("Clamping indirect_sg_entries to %u\n", + SG_MAX_SEGMENTS); + indirect_sg_entries = SG_MAX_SEGMENTS; + } + srp_remove_wq = create_workqueue("srp_remove"); if (!srp_remove_wq) { ret = -ENOMEM;