From patchwork Tue Feb 14 02:19:54 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Laurence Oberman X-Patchwork-Id: 9571019 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 7CFEC60586 for ; Tue, 14 Feb 2017 02:20:12 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 69F39205FD for ; Tue, 14 Feb 2017 02:20:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 5C89127D4D; Tue, 14 Feb 2017 02:20:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 58807205FD for ; Tue, 14 Feb 2017 02:20:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751741AbdBNCUG (ORCPT ); Mon, 13 Feb 2017 21:20:06 -0500 Received: from mx5-phx2.redhat.com ([209.132.183.37]:42383 "EHLO mx5-phx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751573AbdBNCUF (ORCPT ); Mon, 13 Feb 2017 21:20:05 -0500 Received: from zmail22.collab.prod.int.phx2.redhat.com (zmail22.collab.prod.int.phx2.redhat.com [10.5.83.26]) by mx5-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id v1E2JuwP025332; Mon, 13 Feb 2017 21:19:56 -0500 Date: Mon, 13 Feb 2017 21:19:54 -0500 (EST) From: Laurence Oberman To: Bart Van Assche Cc: leon@kernel.org, hch@lst.de, maxg@mellanox.com, israelr@mellanox.com, linux-rdma@vger.kernel.org, dledford@redhat.com Message-ID: <568916592.30910570.1487038794766.JavaMail.zimbra@redhat.com> In-Reply-To: <1487022735.2719.7.camel@sandisk.com> References: <20170210235611.3243-1-bart.vanassche@sandisk.com> <20170213141724.GQ14015@mtr-leonro.local> <225897984.30545262.1486995841880.JavaMail.zimbra@redhat.com> <1971987443.30613645.1487002375580.JavaMail.zimbra@redhat.com> <21338434.30712464.1487004451595.JavaMail.zimbra@redhat.com> <1301607843.30852658.1487021644535.JavaMail.zimbra@redhat.com> <898197116.30855343.1487022400065.JavaMail.zimbra@redhat.com> <1487022735.2719.7.camel@sandisk.com> Subject: Re: v4.10-rc SRP + mlx5 regression MIME-Version: 1.0 X-Originating-IP: [10.10.120.193] X-Mailer: Zimbra 8.0.6_GA_5922 (ZimbraWebClient - FF51 (Linux)/8.0.6_GA_5922) Thread-Topic: v4.10-rc SRP + mlx5 regression Thread-Index: AQHShkN48rRJKlaRKUa7aJtB2e83RiuLX8Cn Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP ----- Original Message ----- > From: "Bart Van Assche" > To: leon@kernel.org, loberman@redhat.com > Cc: hch@lst.de, maxg@mellanox.com, israelr@mellanox.com, linux-rdma@vger.kernel.org, dledford@redhat.com > Sent: Monday, February 13, 2017 4:52:28 PM > Subject: Re: v4.10-rc SRP + mlx5 regression > > On Mon, 2017-02-13 at 16:46 -0500, Laurence Oberman wrote: > > I will have to run through this again and see where the bisect went wrong. > > Hello Laurence, > > If you would be considering to repeat the bisect, did you know that a bisect > can be sped up by specifying the names of the files and/or directories that > are suspected? An example: > > git bisect start */infiniband */net > > Bart.-- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Hello Bart, Much better news this time :), worked late on this but got it figured out. OK, so we got to this one, which makes a lot more sense and is right in the area where we are having issues. I must have answered wrong to one of the steps the first time I did the bisect. Reverted this in the master tree of rc8 and rebuilt the kernel Now all tests pass on Linus's tree - 4.10.0_rc8+ The interesting point here is that this commit is in rc5 but rc5 was not failing so we have an interoperability issue with this commit [loberman@ibclient linux]$ git bisect good Bisecting: 0 revisions left to test after this (roughly 1 step) [ad8e66b4a80182174f73487ed25fd2140cf43361] IB/srp: fix mr allocation when the device supports sg gaps [loberman@ibclient linux]$ git show ad8e66b4a80182174f73487ed25fd2140cf43361 commit ad8e66b4a80182174f73487ed25fd2140cf43361 Author: Israel Rukshin Date: Wed Dec 28 12:48:28 2016 +0200 IB/srp: fix mr allocation when the device supports sg gaps If the device support arbitrary sg list mapping (device cap IB_DEVICE_SG_GAPS_REG set) we allocate the memory regions with IB_MR_TYPE_SG_GAPS. Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") Cc: # 4.7+ Signed-off-by: Israel Rukshin Signed-off-by: Max Gurtovoy Reviewed-by: Leon Romanovsky Reviewed-by: Mark Bloch Reviewed-by: Yuval Shaia Reviewed-by: Bart Van Assche Signed-off-by: Doug Ledford Now moving on to what got me here in the first place. Bart, let me know if the 7 of the 8 patches in your most recent series are all still valid after this revert Otherwise let me know which ones you want me to apply. patch 6 - I am thinking i sno longer valid. " If a HCA supports the SG_GAPS_REG feature then a single memory region of type IB_MR_TYPE_SG_GAPS is sufficient. This patch reduces the number of memory regions that is allocated per SRP session. " Thanks Laurence --- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 8ddc071..0f67cf9 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -371,6 +371,7 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, struct srp_fr_desc *d; struct ib_mr *mr; int i, ret = -EINVAL; + enum ib_mr_type mr_type; if (pool_size <= 0) goto err; @@ -384,9 +385,13 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, spin_lock_init(&pool->lock); INIT_LIST_HEAD(&pool->free_list); + if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) + mr_type = IB_MR_TYPE_SG_GAPS; + else + mr_type = IB_MR_TYPE_MEM_REG; + for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { - mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, - max_page_list_len); + mr = ib_alloc_mr(pd, mr_type, max_page_list_len); if (IS_ERR(mr)) { ret = PTR_ERR(mr); if (ret == -ENOMEM) (END) So here is the revert patch, but you need to decide how you want to deal with this. Revert "IB/srp: fix mr allocation when the device supports sg gaps" Laurence Oberman Traced after bisection to a cause for this failure Tested-by: Laurence Oberman Signed-off-by: Laurence Oberman commit 90d169d312a173d5350c1bb36d6daab04c592127 Author: Laurence Oberman Date: Mon Feb 13 20:33:32 2017 -0500 Revert "IB/srp: fix mr allocation when the device supports sg gaps" Laurence Oberman Traced after bisection to a cause for this failure [ 130.437603] mlx5_0:dump_cqe:262:(pid 3812): dump error cqe [ 130.437682] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0edbfb0 [ 130.510899] 00000000 00000000 00000000 00000000 [ 130.536455] 00000000 00000000 00000000 00000000 [ 130.561878] 00000000 00000000 00000000 00000000 [ 130.585904] 00000000 0f007806 2500002a db0ec4d0 [ 145.842925] fast_io_fail_tmo expired for SRP port-1:1 / host1. [ 146.530439] scsi host1: ib_srp: reconnect succeeded [ 146.566629] mlx5_0:dump_cqe:262:(pid 3293): dump error cqe [ 146.597635] 00000000 00000000 00000000 00000000 [ 146.623545] 00000000 00000000 00000000 00000000 [ 146.649599] 00000000 00000000 00000000 00000000 [ 146.673938] 00000000 0f007806 25000032 000c46d0 [ 146.697969] scsi host1: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff88 [ 162.225247] fast_io_fail_tmo expired for SRP port-1:1 / host1. [ 162.256337] scsi host1: ib_srp: reconnect succeeded [ 162.293396] scsi host1: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817f0412ef0` This reverts commit ad8e66b4a80182174f73487ed25fd2140cf43361. diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 79bf484..01338c8 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -371,7 +371,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, struct srp_fr_desc *d; struct ib_mr *mr; int i, ret = -EINVAL; - enum ib_mr_type mr_type; if (pool_size <= 0) goto err; @@ -385,13 +384,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device, spin_lock_init(&pool->lock); INIT_LIST_HEAD(&pool->free_list); - if (device->attrs.device_cap_flags & IB_DEVICE_SG_GAPS_REG) - mr_type = IB_MR_TYPE_SG_GAPS; - else - mr_type = IB_MR_TYPE_MEM_REG; - for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) { - mr = ib_alloc_mr(pd, mr_type, max_page_list_len); + mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, + max_page_list_len); if (IS_ERR(mr)) { ret = PTR_ERR(mr); if (ret == -ENOMEM)