From patchwork Wed May 10 14:06:47 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Laurence Oberman X-Patchwork-Id: 9719917 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 87C636035D for ; Wed, 10 May 2017 14:06:51 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 783F528610 for ; Wed, 10 May 2017 14:06:51 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6C4B92861B; Wed, 10 May 2017 14:06:51 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1EFB728610 for ; Wed, 10 May 2017 14:06:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752778AbdEJOGt (ORCPT ); Wed, 10 May 2017 10:06:49 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47776 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750733AbdEJOGs (ORCPT ); Wed, 10 May 2017 10:06:48 -0400 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C34907E9CC; Wed, 10 May 2017 14:06:47 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com C34907E9CC Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx02.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=loberman@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com C34907E9CC Received: from colo-mx.corp.redhat.com (colo-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AF5D396527; Wed, 10 May 2017 14:06:47 +0000 (UTC) Received: from zmail22.collab.prod.int.phx2.redhat.com (zmail22.collab.prod.int.phx2.redhat.com [10.5.83.26]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id 9B4F01800C94; Wed, 10 May 2017 14:06:47 +0000 (UTC) Date: Wed, 10 May 2017 10:06:47 -0400 (EDT) From: Laurence Oberman To: Sagi Grimberg Cc: Leon Romanovsky , Bart Van Assche , Doug Ledford , Max Gurtovoy , Israel Rukshin , linux-rdma@vger.kernel.org Message-ID: <1415936724.7101967.1494425207538.JavaMail.zimbra@redhat.com> In-Reply-To: <1072634318.5542006.1494001866306.JavaMail.zimbra@redhat.com> References: <8992bd28-667f-94b1-e582-106e6b41aa4b@sandisk.com> <20170425175849.GS14088@mtr-leonro.local> <438230391.2090966.1493152655709.JavaMail.zimbra@redhat.com> <20170426061640.GV14088@mtr-leonro.local> <501334895.4531615.1493820950718.JavaMail.zimbra@redhat.com> <374fcc74-4b84-610b-b55e-d385563bef6f@grimberg.me> <1072634318.5542006.1494001866306.JavaMail.zimbra@redhat.com> Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array MIME-Version: 1.0 X-Originating-IP: [10.18.49.4, 10.4.195.3] Thread-Topic: mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array Thread-Index: qAQFxhb7gSqBX2pjzj9FmgXrWs/julkMLnsu X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Wed, 10 May 2017 14:06:48 +0000 (UTC) Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP ----- Original Message ----- > From: "Laurence Oberman" > To: "Sagi Grimberg" > Cc: "Leon Romanovsky" , "Bart Van Assche" , "Doug Ledford" > , "Max Gurtovoy" , "Israel Rukshin" , > linux-rdma@vger.kernel.org > Sent: Friday, May 5, 2017 12:31:06 PM > Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array > > > > ----- Original Message ----- > > From: "Sagi Grimberg" > > To: "Laurence Oberman" > > Cc: "Leon Romanovsky" , "Bart Van Assche" > > , "Doug Ledford" > > , "Max Gurtovoy" , "Israel Rukshin" > > , > > linux-rdma@vger.kernel.org > > Sent: Wednesday, May 3, 2017 10:58:43 AM > > Subject: Re: [PATCH, untested] mlx5: Avoid that mlx5_ib_sg_to_klms() > > overflows the klms[] array > > > > > > > Hello Sagi > > > Against Bart's tree again > > > > > > a83e404 IB/srp: Reenable IB_MR_TYPE_SG_GAPS > > > dfa5a2b mlx5: Avoid that mlx5_ib_sg_to_klms() overflows the klms[] array > > > f759c80 mlx5: Fix mlx5_ib_map_mr_sg mr lengt > > > > > > Above are all in > > > Added your most recent patch above > > > > > > Same behavior. > > > [ 579.368733] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE ffff8817de9c57b0 > > > [ 579.369875] mlx5_1:dump_cqe:262:(pid 15140): dump error cqe > > > [ 579.369877] 00000000 00000000 00000000 00000000 > > > [ 579.369877] 00000000 00000000 00000000 00000000 > > > [ 579.369878] 00000000 00000000 00000000 00000000 > > > [ 579.369878] 00000000 0f007806 2500002b 1c528dd0 > > > [ 579.369883] scsi host1: ib_srp: failed FAST REG status memory > > > management > > > operation error (6) for CQE ffff88179a460af8 > > > [ 594.814222] scsi host1: ib_srp: reconnect succeeded > > > [ 594.916876] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE ffff8817e1d4a6b0 > > > [ 595.494532] mlx5_1:dump_cqe:262:(pid 15205): dump error cqe > > > [ 595.525995] 00000000 00000000 00000000 00000000 > > > [ 595.552125] 00000000 00000000 00000000 00000000 > > > [ 595.578204] 00000000 00000000 00000000 00000000 > > > [ 595.603670] 00000000 0f007806 25000033 002d77d0 > > > ^C[ 610.821911] scsi host1: ib_srp: reconnect succeeded > > > [ 610.933298] scsi host1: ib_srp: failed RECV status WR flushed (5) for > > > CQE ffff8817e1d4a170 > > > [ 611.514234] mlx5_1:dump_cqe:262:(pid 15242): dump error cqe > > > [ 611.543083] 00000000 00000000 00000000 00000000 > > > [ 611.568670] 00000000 00000000 00000000 00000000 > > > [ 611.594064] 00000000 00000000 00000000 00000000 > > > [ 611.620142] 00000000 0f007806 2500003b 003161d0 > > > > > > I will capture the function traces with your patch applied and the > > > additional logging asked for by Max. > > > > Thanks, that would be helpful, > > > > Can you try the following patch, just to see if there is an off by 1 case: > > > > -- > > diff --git a/drivers/infiniband/hw/mlx5/mr.c > > b/drivers/infiniband/hw/mlx5/mr.c > > index b8f9382a8b7d..3d6ef7bce7d9 100644 > > --- a/drivers/infiniband/hw/mlx5/mr.c > > +++ b/drivers/infiniband/hw/mlx5/mr.c > > @@ -1525,7 +1525,7 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd, > > { > > struct mlx5_ib_dev *dev = to_mdev(pd->device); > > int inlen = MLX5_ST_SZ_BYTES(create_mkey_in); > > - int ndescs = ALIGN(max_num_sg, 4); > > + int ndescs = ALIGN(max_num_sg + 1, 4); > > struct mlx5_ib_mr *mr; > > void *mkc; > > u32 *in; > > -- > > > > It's not a fix, but if it works it can give us a clue... > > > > Sorry, been delayed this week, will get this done this weekend. > Thanks > Laurence > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Sagi, Max With the patch below against Barts tree we still see the cqe_dump issue. Is what is in the everything you wanted applied. Please check I did not miss anything before I start the tracing. May 9 17:16:00 localhost kernel: scsi host2: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817ed75c830 May 9 17:16:00 localhost kernel: mlx5_1:dump_cqe:262:(pid 14567): dump error cqe May 9 17:16:00 localhost kernel: 00000000 00000000 00000000 00000000 May 9 17:16:00 localhost kernel: 00000000 00000000 00000000 00000000 May 9 17:16:00 localhost kernel: 00000000 00000000 00000000 00000000 May 9 17:16:00 localhost kernel: 00000000 0f007806 2500002a 0b670bd0 May 9 17:16:00 localhost kernel: scsi host2: ib_srp: failed FAST REG status memory management operation error (6) for CQE ffff8817972ac278 May 9 17:16:16 localhost kernel: scsi host2: ib_srp: reconnect succeeded May 9 17:16:16 localhost kernel: scsi host2: ib_srp: failed RECV status WR flushed (5) for CQE ffff8817d819b130 Thanks Laurence --- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 99beacf..cf899b4 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -1525,7 +1525,8 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd, { struct mlx5_ib_dev *dev = to_mdev(pd->device); int inlen = MLX5_ST_SZ_BYTES(create_mkey_in); - int ndescs = ALIGN(max_num_sg, 4); + //int ndescs = ALIGN(max_num_sg, 4); + int ndescs = ALIGN(max_num_sg + 1, 4); struct mlx5_ib_mr *mr; void *mkc; u32 *in; diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..cb726a5 100644 --- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -3224,22 +3224,19 @@ static void set_reg_mkey_seg(struct mlx5_mkey_seg *seg, struct mlx5_ib_mr *mr, u32 key, int access) { - int ndescs = ALIGN(mr->ndescs, 8) >> 1; + int size = mr->ndescs * mr->desc_size; memset(seg, 0, sizeof(*seg)); if (mr->access_mode == MLX5_MKC_ACCESS_MODE_MTT) seg->log2_page_size = ilog2(mr->ibmr.page_size); - else if (mr->access_mode == MLX5_MKC_ACCESS_MODE_KLMS) - /* KLMs take twice the size of MTTs */ - ndescs *= 2; seg->flags = get_umr_flags(access) | mr->access_mode; seg->qpn_mkey7_0 = cpu_to_be32((key & 0xff) | 0xffffff00); seg->flags_pd = cpu_to_be32(MLX5_MKEY_REMOTE_INVAL); seg->start_addr = cpu_to_be64(mr->ibmr.iova); seg->len = cpu_to_be64(mr->ibmr.length); - seg->xlt_oct_size = cpu_to_be32(ndescs); + seg->xlt_oct_size = cpu_to_be32(get_xlt_octo(size)); } I will see about capturing traces, but I am writing to a RAM disk on the target so likely will have a flood of trace data.