From patchwork Wed Jun 17 12:32:43 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Marciniszyn, Mike" X-Patchwork-Id: 6625261 Return-Path: X-Original-To: patchwork-linux-rdma@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 2DFC09F46A for ; Wed, 17 Jun 2015 12:33:06 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id E06F32034F for ; Wed, 17 Jun 2015 12:33:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A6BEE208A9 for ; Wed, 17 Jun 2015 12:32:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754900AbbFQMcs (ORCPT ); Wed, 17 Jun 2015 08:32:48 -0400 Received: from mga01.intel.com ([192.55.52.88]:1921 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751532AbbFQMcr (ORCPT ); Wed, 17 Jun 2015 08:32:47 -0400 Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga101.fm.intel.com with ESMTP; 17 Jun 2015 05:32:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,632,1427785200"; d="scan'208";a="589515040" Received: from sedona.ch.intel.com ([143.182.228.65]) by orsmga003.jf.intel.com with ESMTP; 17 Jun 2015 05:32:45 -0700 Received: from phlsvsles11.ph.intel.com (phlsvsles11.ph.intel.com [10.228.195.43]) by sedona.ch.intel.com (8.13.6/8.14.3/Standard MailSET/Hub) with ESMTP id t5HCWiiG017733; Wed, 17 Jun 2015 05:32:44 -0700 Received: from phlsvslse11.ph.intel.com (localhost [127.0.0.1]) by phlsvsles11.ph.intel.com with ESMTP id t5HCWhqu009774; Wed, 17 Jun 2015 08:32:44 -0400 Subject: [PATCH v3 46/49] IB/hfi1: add general verbs handling To: dledford@redhat.com From: Mike Marciniszyn Cc: linux-rdma@vger.kernel.org Date: Wed, 17 Jun 2015 08:32:43 -0400 Message-ID: <20150617123243.8744.91659.stgit@phlsvslse11.ph.intel.com> In-Reply-To: <20150617122755.8744.44665.stgit@phlsvslse11.ph.intel.com> References: <20150617122755.8744.44665.stgit@phlsvslse11.ph.intel.com> User-Agent: StGit/0.16 MIME-Version: 1.0 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Spam-Status: No, score=-7.5 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Signed-off-by: Andrew Friedley Signed-off-by: Arthur Kepner Signed-off-by: Brendan Cunningham Signed-off-by: Brian Welty Signed-off-by: Caz Yokoyama Signed-off-by: Dean Luick Signed-off-by: Dennis Dalessandro Signed-off-by: Easwar Hariharan Signed-off-by: Harish Chegondi Signed-off-by: Ira Weiny Signed-off-by: Jim Snow Signed-off-by: John Gregor Signed-off-by: Jubin John Signed-off-by: Kaike Wan Signed-off-by: Kevin Pine Signed-off-by: Kyle Liddell Signed-off-by: Mike Marciniszyn Signed-off-by: Mitko Haralanov Signed-off-by: Ravi Krishnaswamy Signed-off-by: Sadanand Warrier Signed-off-by: Sanath Kumar Signed-off-by: Sudeep Dutt Signed-off-by: Vlad Danushevsky --- drivers/infiniband/hw/hfi1/verbs.c | 2215 ++++++++++++++++++++++++++++++++++++ drivers/infiniband/hw/hfi1/verbs.h | 1193 +++++++++++++++++++ 2 files changed, 3408 insertions(+) create mode 100644 drivers/infiniband/hw/hfi1/verbs.c create mode 100644 drivers/infiniband/hw/hfi1/verbs.h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c new file mode 100644 index 0000000..680fd41 --- /dev/null +++ b/drivers/infiniband/hw/hfi1/verbs.c @@ -0,0 +1,2215 @@ +/* + * + * This file is provided under a dual BSD/GPLv2 license. When using or + * redistributing this file, you may do so under either license. + * + * GPL LICENSE SUMMARY + * + * Copyright(c) 2015 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * BSD LICENSE + * + * Copyright(c) 2015 Intel Corporation. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * - Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * - Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * - Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hfi.h" +#include "common.h" +#include "device.h" +#include "trace.h" +#include "qp.h" +#include "sdma.h" + +unsigned int hfi1_lkey_table_size = 16; +module_param_named(lkey_table_size, hfi1_lkey_table_size, uint, + S_IRUGO); +MODULE_PARM_DESC(lkey_table_size, + "LKEY table size in bits (2^n, 1 <= n <= 23)"); + +static unsigned int hfi1_max_pds = 0xFFFF; +module_param_named(max_pds, hfi1_max_pds, uint, S_IRUGO); +MODULE_PARM_DESC(max_pds, + "Maximum number of protection domains to support"); + +static unsigned int hfi1_max_ahs = 0xFFFF; +module_param_named(max_ahs, hfi1_max_ahs, uint, S_IRUGO); +MODULE_PARM_DESC(max_ahs, "Maximum number of address handles to support"); + +unsigned int hfi1_max_cqes = 0x2FFFF; +module_param_named(max_cqes, hfi1_max_cqes, uint, S_IRUGO); +MODULE_PARM_DESC(max_cqes, + "Maximum number of completion queue entries to support"); + +unsigned int hfi1_max_cqs = 0x1FFFF; +module_param_named(max_cqs, hfi1_max_cqs, uint, S_IRUGO); +MODULE_PARM_DESC(max_cqs, "Maximum number of completion queues to support"); + +unsigned int hfi1_max_qp_wrs = 0x3FFF; +module_param_named(max_qp_wrs, hfi1_max_qp_wrs, uint, S_IRUGO); +MODULE_PARM_DESC(max_qp_wrs, "Maximum number of QP WRs to support"); + +unsigned int hfi1_max_qps = 16384; +module_param_named(max_qps, hfi1_max_qps, uint, S_IRUGO); +MODULE_PARM_DESC(max_qps, "Maximum number of QPs to support"); + +unsigned int hfi1_max_sges = 0x60; +module_param_named(max_sges, hfi1_max_sges, uint, S_IRUGO); +MODULE_PARM_DESC(max_sges, "Maximum number of SGEs to support"); + +unsigned int hfi1_max_mcast_grps = 16384; +module_param_named(max_mcast_grps, hfi1_max_mcast_grps, uint, S_IRUGO); +MODULE_PARM_DESC(max_mcast_grps, + "Maximum number of multicast groups to support"); + +unsigned int hfi1_max_mcast_qp_attached = 16; +module_param_named(max_mcast_qp_attached, hfi1_max_mcast_qp_attached, + uint, S_IRUGO); +MODULE_PARM_DESC(max_mcast_qp_attached, + "Maximum number of attached QPs to support"); + +unsigned int hfi1_max_srqs = 1024; +module_param_named(max_srqs, hfi1_max_srqs, uint, S_IRUGO); +MODULE_PARM_DESC(max_srqs, "Maximum number of SRQs to support"); + +unsigned int hfi1_max_srq_sges = 128; +module_param_named(max_srq_sges, hfi1_max_srq_sges, uint, S_IRUGO); +MODULE_PARM_DESC(max_srq_sges, "Maximum number of SRQ SGEs to support"); + +unsigned int hfi1_max_srq_wrs = 0x1FFFF; +module_param_named(max_srq_wrs, hfi1_max_srq_wrs, uint, S_IRUGO); +MODULE_PARM_DESC(max_srq_wrs, "Maximum number of SRQ WRs support"); + +static void verbs_sdma_complete( + struct sdma_txreq *cookie, + int status, + int drained); + +/* + * Note that it is OK to post send work requests in the SQE and ERR + * states; hfi1_do_send() will process them and generate error + * completions as per IB 1.2 C10-96. + */ +const int ib_hfi1_state_ops[IB_QPS_ERR + 1] = { + [IB_QPS_RESET] = 0, + [IB_QPS_INIT] = HFI1_POST_RECV_OK, + [IB_QPS_RTR] = HFI1_POST_RECV_OK | HFI1_PROCESS_RECV_OK, + [IB_QPS_RTS] = HFI1_POST_RECV_OK | HFI1_PROCESS_RECV_OK | + HFI1_POST_SEND_OK | HFI1_PROCESS_SEND_OK | + HFI1_PROCESS_NEXT_SEND_OK, + [IB_QPS_SQD] = HFI1_POST_RECV_OK | HFI1_PROCESS_RECV_OK | + HFI1_POST_SEND_OK | HFI1_PROCESS_SEND_OK, + [IB_QPS_SQE] = HFI1_POST_RECV_OK | HFI1_PROCESS_RECV_OK | + HFI1_POST_SEND_OK | HFI1_FLUSH_SEND, + [IB_QPS_ERR] = HFI1_POST_RECV_OK | HFI1_FLUSH_RECV | + HFI1_POST_SEND_OK | HFI1_FLUSH_SEND, +}; + +struct hfi1_ucontext { + struct ib_ucontext ibucontext; +}; + +static inline struct hfi1_ucontext *to_iucontext(struct ib_ucontext + *ibucontext) +{ + return container_of(ibucontext, struct hfi1_ucontext, ibucontext); +} + +/* + * Translate ib_wr_opcode into ib_wc_opcode. + */ +const enum ib_wc_opcode ib_hfi1_wc_opcode[] = { + [IB_WR_RDMA_WRITE] = IB_WC_RDMA_WRITE, + [IB_WR_RDMA_WRITE_WITH_IMM] = IB_WC_RDMA_WRITE, + [IB_WR_SEND] = IB_WC_SEND, + [IB_WR_SEND_WITH_IMM] = IB_WC_SEND, + [IB_WR_RDMA_READ] = IB_WC_RDMA_READ, + [IB_WR_ATOMIC_CMP_AND_SWP] = IB_WC_COMP_SWAP, + [IB_WR_ATOMIC_FETCH_AND_ADD] = IB_WC_FETCH_ADD +}; + +/* + * Length of header by opcode, 0 --> not supported + */ +const u8 hdr_len_by_opcode[256] = { + /* RC */ + [IB_OPCODE_RC_SEND_FIRST] = 12 + 8, + [IB_OPCODE_RC_SEND_MIDDLE] = 12 + 8, + [IB_OPCODE_RC_SEND_LAST] = 12 + 8, + [IB_OPCODE_RC_SEND_LAST_WITH_IMMEDIATE] = 12 + 8 + 4, + [IB_OPCODE_RC_SEND_ONLY] = 12 + 8, + [IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE] = 12 + 8 + 4, + [IB_OPCODE_RC_RDMA_WRITE_FIRST] = 12 + 8 + 16, + [IB_OPCODE_RC_RDMA_WRITE_MIDDLE] = 12 + 8, + [IB_OPCODE_RC_RDMA_WRITE_LAST] = 12 + 8, + [IB_OPCODE_RC_RDMA_WRITE_LAST_WITH_IMMEDIATE] = 12 + 8 + 4, + [IB_OPCODE_RC_RDMA_WRITE_ONLY] = 12 + 8 + 16, + [IB_OPCODE_RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE] = 12 + 8 + 20, + [IB_OPCODE_RC_RDMA_READ_REQUEST] = 12 + 8 + 16, + [IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST] = 12 + 8 + 4, + [IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE] = 12 + 8, + [IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST] = 12 + 8 + 4, + [IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY] = 12 + 8 + 4, + [IB_OPCODE_RC_ACKNOWLEDGE] = 12 + 8 + 4, + [IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE] = 12 + 8 + 4, + [IB_OPCODE_RC_COMPARE_SWAP] = 12 + 8 + 28, + [IB_OPCODE_RC_FETCH_ADD] = 12 + 8 + 28, + /* UC */ + [IB_OPCODE_UC_SEND_FIRST] = 12 + 8, + [IB_OPCODE_UC_SEND_MIDDLE] = 12 + 8, + [IB_OPCODE_UC_SEND_LAST] = 12 + 8, + [IB_OPCODE_UC_SEND_LAST_WITH_IMMEDIATE] = 12 + 8 + 4, + [IB_OPCODE_UC_SEND_ONLY] = 12 + 8, + [IB_OPCODE_UC_SEND_ONLY_WITH_IMMEDIATE] = 12 + 8 + 4, + [IB_OPCODE_UC_RDMA_WRITE_FIRST] = 12 + 8 + 16, + [IB_OPCODE_UC_RDMA_WRITE_MIDDLE] = 12 + 8, + [IB_OPCODE_UC_RDMA_WRITE_LAST] = 12 + 8, + [IB_OPCODE_UC_RDMA_WRITE_LAST_WITH_IMMEDIATE] = 12 + 8 + 4, + [IB_OPCODE_UC_RDMA_WRITE_ONLY] = 12 + 8 + 16, + [IB_OPCODE_UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE] = 12 + 8 + 20, + /* UD */ + [IB_OPCODE_UD_SEND_ONLY] = 12 + 8 + 8, + [IB_OPCODE_UD_SEND_ONLY_WITH_IMMEDIATE] = 12 + 8 + 12 +}; + +/* + * System image GUID. + */ +__be64 ib_hfi1_sys_image_guid; + +/** + * hfi1_copy_sge - copy data to SGE memory + * @ss: the SGE state + * @data: the data to copy + * @length: the length of the data + */ +void hfi1_copy_sge( + struct hfi1_sge_state *ss, + void *data, u32 length, + int release) +{ + struct hfi1_sge *sge = &ss->sge; + + while (length) { + u32 len = sge->length; + + if (len > length) + len = length; + if (len > sge->sge_length) + len = sge->sge_length; + BUG_ON(len == 0); + memcpy(sge->vaddr, data, len); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (release) + hfi1_put_mr(sge->mr); + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr->lkey) { + if (++sge->n >= HFI1_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = + sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = + sge->mr->map[sge->m]->segs[sge->n].length; + } + data += len; + length -= len; + } +} + +/** + * hfi1_skip_sge - skip over SGE memory + * @ss: the SGE state + * @length: the number of bytes to skip + */ +void hfi1_skip_sge(struct hfi1_sge_state *ss, u32 length, int release) +{ + struct hfi1_sge *sge = &ss->sge; + + while (length) { + u32 len = sge->length; + + if (len > length) + len = length; + if (len > sge->sge_length) + len = sge->sge_length; + BUG_ON(len == 0); + sge->vaddr += len; + sge->length -= len; + sge->sge_length -= len; + if (sge->sge_length == 0) { + if (release) + hfi1_put_mr(sge->mr); + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr->lkey) { + if (++sge->n >= HFI1_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + break; + sge->n = 0; + } + sge->vaddr = + sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = + sge->mr->map[sge->m]->segs[sge->n].length; + } + length -= len; + } +} + +/** + * post_one_send - post one RC, UC, or UD send work request + * @qp: the QP to post on + * @wr: the work request to send + */ +static int post_one_send(struct hfi1_qp *qp, struct ib_send_wr *wr, + int *scheduled) +{ + struct hfi1_swqe *wqe; + u32 next; + int i; + int j; + int acc; + int ret; + unsigned long flags; + struct hfi1_lkey_table *rkt; + struct hfi1_pd *pd; + u8 sc5; + struct hfi1_devdata *dd = dd_from_ibdev(qp->ibqp.device); + struct hfi1_pportdata *ppd; + struct hfi1_ibport *ibp; + + spin_lock_irqsave(&qp->s_lock, flags); + ppd = &dd->pport[qp->port_num - 1]; + ibp = &ppd->ibport_data; + + /* Check that state is OK to post send. */ + if (unlikely(!(ib_hfi1_state_ops[qp->state] & HFI1_POST_SEND_OK))) + goto bail_inval; + + /* IB spec says that num_sge == 0 is OK. */ + if (wr->num_sge > qp->s_max_sge) + goto bail_inval; + + /* + * Don't allow RDMA reads or atomic operations on UC or + * undefined operations. + * Make sure buffer is large enough to hold the result for atomics. + */ + if (wr->opcode == IB_WR_FAST_REG_MR) { + if (hfi1_fast_reg_mr(qp, wr)) + goto bail_inval; + } else if (qp->ibqp.qp_type == IB_QPT_UC) { + if ((unsigned) wr->opcode >= IB_WR_RDMA_READ) + goto bail_inval; + } else if (qp->ibqp.qp_type != IB_QPT_RC) { + /* Check IB_QPT_SMI, IB_QPT_GSI, IB_QPT_UD opcode */ + if (wr->opcode != IB_WR_SEND && + wr->opcode != IB_WR_SEND_WITH_IMM) + goto bail_inval; + /* Check UD destination address PD */ + if (qp->ibqp.pd != wr->wr.ud.ah->pd) + goto bail_inval; + } else if ((unsigned) wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD) + goto bail_inval; + else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP && + (wr->num_sge == 0 || + wr->sg_list[0].length < sizeof(u64) || + wr->sg_list[0].addr & (sizeof(u64) - 1))) + goto bail_inval; + else if (wr->opcode >= IB_WR_RDMA_READ && !qp->s_max_rd_atomic) + goto bail_inval; + + next = qp->s_head + 1; + if (next >= qp->s_size) + next = 0; + if (next == qp->s_last) { + ret = -ENOMEM; + goto bail; + } + + rkt = &to_idev(qp->ibqp.device)->lk_table; + pd = to_ipd(qp->ibqp.pd); + wqe = get_swqe_ptr(qp, qp->s_head); + wqe->wr = *wr; + wqe->length = 0; + j = 0; + if (wr->num_sge) { + acc = wr->opcode >= IB_WR_RDMA_READ ? + IB_ACCESS_LOCAL_WRITE : 0; + for (i = 0; i < wr->num_sge; i++) { + u32 length = wr->sg_list[i].length; + int ok; + + if (length == 0) + continue; + ok = hfi1_lkey_ok(rkt, pd, &wqe->sg_list[j], + &wr->sg_list[i], acc); + if (!ok) + goto bail_inval_free; + wqe->length += length; + j++; + } + wqe->wr.num_sge = j; + } + if (qp->ibqp.qp_type == IB_QPT_UC || + qp->ibqp.qp_type == IB_QPT_RC) { + if (wqe->length > 0x80000000U) + goto bail_inval_free; + sc5 = ibp->sl_to_sc[qp->remote_ah_attr.sl]; + } else { + struct hfi1_ah *ah = to_iah(wr->wr.ud.ah); + u8 vl; + + sc5 = ibp->sl_to_sc[ah->attr.sl]; + vl = sc_to_vlt(dd, sc5); + if (vl < PER_VL_SEND_CONTEXTS) + if (wqe->length > dd->vld[vl].mtu) + goto bail_inval_free; + + atomic_inc(&ah->refcount); + } + wqe->ssn = qp->s_ssn++; + qp->s_head = next; + + ret = 0; + goto bail; + +bail_inval_free: + while (j) { + struct hfi1_sge *sge = &wqe->sg_list[--j]; + + hfi1_put_mr(sge->mr); + } +bail_inval: + ret = -EINVAL; +bail: + if (!ret && !wr->next) { + struct sdma_engine *sde; + + sde = qp_to_sdma_engine(qp, sc5); + if (sde && !sdma_empty(sde)) { + hfi1_schedule_send(qp); + *scheduled = 1; + } + } + spin_unlock_irqrestore(&qp->s_lock, flags); + return ret; +} + +/** + * post_send - post a send on a QP + * @ibqp: the QP to post the send on + * @wr: the list of work requests to post + * @bad_wr: the first bad WR is put here + * + * This may be called from interrupt context. + */ +static int post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, + struct ib_send_wr **bad_wr) +{ + struct hfi1_qp *qp = to_iqp(ibqp); + int err = 0; + int scheduled = 0; + + for (; wr; wr = wr->next) { + err = post_one_send(qp, wr, &scheduled); + if (err) { + *bad_wr = wr; + goto bail; + } + } + + /* Try to do the send work in the caller's context. */ + if (!scheduled) + hfi1_do_send(&qp->s_iowait.iowork); + +bail: + return err; +} + +/** + * post_receive - post a receive on a QP + * @ibqp: the QP to post the receive on + * @wr: the WR to post + * @bad_wr: the first bad WR is put here + * + * This may be called from interrupt context. + */ +static int post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr) +{ + struct hfi1_qp *qp = to_iqp(ibqp); + struct hfi1_rwq *wq = qp->r_rq.wq; + unsigned long flags; + int ret; + + /* Check that state is OK to post receive. */ + if (!(ib_hfi1_state_ops[qp->state] & HFI1_POST_RECV_OK) || !wq) { + *bad_wr = wr; + ret = -EINVAL; + goto bail; + } + + for (; wr; wr = wr->next) { + struct hfi1_rwqe *wqe; + u32 next; + int i; + + if ((unsigned) wr->num_sge > qp->r_rq.max_sge) { + *bad_wr = wr; + ret = -EINVAL; + goto bail; + } + + spin_lock_irqsave(&qp->r_rq.lock, flags); + next = wq->head + 1; + if (next >= qp->r_rq.size) + next = 0; + if (next == wq->tail) { + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + *bad_wr = wr; + ret = -ENOMEM; + goto bail; + } + + wqe = get_rwqe_ptr(&qp->r_rq, wq->head); + wqe->wr_id = wr->wr_id; + wqe->num_sge = wr->num_sge; + for (i = 0; i < wr->num_sge; i++) + wqe->sg_list[i] = wr->sg_list[i]; + /* Make sure queue entry is written before the head index. */ + smp_wmb(); + wq->head = next; + spin_unlock_irqrestore(&qp->r_rq.lock, flags); + } + ret = 0; + +bail: + return ret; +} + +/** + * qp_rcv - processing an incoming packet on a QP + * @rcd: the context pointer + * @hdr: the packet header + * @rcv_flags: flags relevant to rcv processing + * @data: the packet data + * @tlen: the packet length + * @qp: the QP the packet came on + * + * This is called from hfi1_ib_rcv() to process an incoming packet + * for the given QP. + * Called at interrupt level. + */ +static void qp_rcv(struct hfi1_ctxtdata *rcd, struct hfi1_ib_header *hdr, + u32 rcv_flags, void *data, u32 tlen, struct hfi1_qp *qp) +{ + struct hfi1_ibport *ibp = &rcd->ppd->ibport_data; + + spin_lock(&qp->r_lock); + + /* Check for valid receive state. */ + if (!(ib_hfi1_state_ops[qp->state] & HFI1_PROCESS_RECV_OK)) { + ibp->n_pkt_drops++; + goto unlock; + } + + switch (qp->ibqp.qp_type) { + case IB_QPT_SMI: + case IB_QPT_GSI: + if (!HFI1_CAP_IS_KSET(ENABLE_SMA)) + break; + /* FALLTHROUGH */ + case IB_QPT_UD: + hfi1_ud_rcv(ibp, hdr, rcv_flags, data, tlen, qp); + break; + + case IB_QPT_RC: + hfi1_rc_rcv(rcd, hdr, rcv_flags, data, tlen, qp); + break; + + case IB_QPT_UC: + hfi1_uc_rcv(ibp, hdr, rcv_flags, data, tlen, qp); + break; + + default: + break; + } + +unlock: + spin_unlock(&qp->r_lock); +} + +/** + * hfi1_ib_rcv - process an incoming packet + * @packet: data packet information + * + * This is called to process an incoming packet at interrupt level. + * + * Tlen is the length of the header + data + CRC in bytes. + */ +void hfi1_ib_rcv(struct hfi1_packet *packet) +{ + struct hfi1_ctxtdata *rcd = packet->rcd; + struct hfi1_ib_header *hdr = packet->hdr; + void *data = packet->ebuf; + u32 tlen = packet->tlen; + struct hfi1_pportdata *ppd = rcd->ppd; + struct hfi1_ibport *ibp = &ppd->ibport_data; + struct hfi1_other_headers *ohdr; + struct hfi1_qp *qp; + u32 qp_num; + u32 rcv_flags = 0; + int lnh; + u8 opcode; + u16 lid; + + /* 24 == LRH+BTH+CRC */ + if (unlikely(tlen < 24)) + goto drop; + + /* Check for a valid destination LID (see ch. 7.11.1). */ + lid = be16_to_cpu(hdr->lrh[1]); + + /* Check for GRH */ + lnh = be16_to_cpu(hdr->lrh[0]) & 3; + if (lnh == HFI1_LRH_BTH) + ohdr = &hdr->u.oth; + else if (lnh == HFI1_LRH_GRH) { + u32 vtf; + + ohdr = &hdr->u.l.oth; + if (hdr->u.l.grh.next_hdr != IB_GRH_NEXT_HDR) + goto drop; + vtf = be32_to_cpu(hdr->u.l.grh.version_tclass_flow); + if ((vtf >> IB_GRH_VERSION_SHIFT) != IB_GRH_VERSION) + goto drop; + } else + goto drop; + + trace_input_ibhdr(rcd->dd, hdr); + + opcode = (be32_to_cpu(ohdr->bth[0]) >> 24) & 0x7f; + inc_opstats(tlen, &rcd->opstats->stats[opcode]); + + /* Get the destination QP number. */ + qp_num = be32_to_cpu(ohdr->bth[1]) & HFI1_QPN_MASK; + if ((lid >= HFI1_MULTICAST_LID_BASE) && + (lid != HFI1_PERMISSIVE_LID)) { + struct hfi1_mcast *mcast; + struct hfi1_mcast_qp *p; + + if (lnh != HFI1_LRH_GRH) + goto drop; + mcast = hfi1_mcast_find(ibp, &hdr->u.l.grh.dgid); + if (mcast == NULL) + goto drop; + rcv_flags |= HFI1_HAS_GRH; + if (rhf_dc_info(packet->rhf)) + rcv_flags |= HFI1_SC4_BIT; + list_for_each_entry_rcu(p, &mcast->qp_list, list) + qp_rcv(rcd, hdr, rcv_flags, data, tlen, p->qp); + /* + * Notify hfi1_multicast_detach() if it is waiting for us + * to finish. + */ + if (atomic_dec_return(&mcast->refcount) <= 1) + wake_up(&mcast->wait); + } else { + if (rcd->lookaside_qp) { + if (rcd->lookaside_qpn != qp_num) { + if (atomic_dec_and_test( + &rcd->lookaside_qp->refcount)) + wake_up(&rcd->lookaside_qp->wait); + rcd->lookaside_qp = NULL; + } + } + if (!rcd->lookaside_qp) { + qp = hfi1_lookup_qpn(ibp, qp_num); + if (!qp) + goto drop; + rcd->lookaside_qp = qp; + rcd->lookaside_qpn = qp_num; + } else + qp = rcd->lookaside_qp; + + if (lnh == HFI1_LRH_GRH) + rcv_flags |= HFI1_HAS_GRH; + if (rhf_dc_info(packet->rhf)) + rcv_flags |= HFI1_SC4_BIT; + qp_rcv(rcd, hdr, rcv_flags, data, tlen, qp); + } + return; + +drop: + ibp->n_pkt_drops++; +} + +/* + * This is called from a timer to check for QPs + * which need kernel memory in order to send a packet. + */ +static void mem_timer(unsigned long data) +{ + struct hfi1_ibdev *dev = (struct hfi1_ibdev *)data; + struct list_head *list = &dev->memwait; + struct hfi1_qp *qp = NULL; + struct iowait *wait; + unsigned long flags; + + spin_lock_irqsave(&dev->pending_lock, flags); + if (!list_empty(list)) { + wait = list_first_entry(list, struct iowait, list); + qp = container_of(wait, struct hfi1_qp, s_iowait); + list_del_init(&qp->s_iowait.list); + /* refcount held until actual wake up */ + if (!list_empty(list)) + mod_timer(&dev->mem_timer, jiffies + 1); + } + spin_unlock_irqrestore(&dev->pending_lock, flags); + + if (qp) + hfi1_qp_wakeup(qp, HFI1_S_WAIT_KMEM); +} + +void update_sge(struct hfi1_sge_state *ss, u32 length) +{ + struct hfi1_sge *sge = &ss->sge; + + sge->vaddr += length; + sge->length -= length; + sge->sge_length -= length; + if (sge->sge_length == 0) { + if (--ss->num_sge) + *sge = *ss->sg_list++; + } else if (sge->length == 0 && sge->mr->lkey) { + if (++sge->n >= HFI1_SEGSZ) { + if (++sge->m >= sge->mr->mapsz) + return; + sge->n = 0; + } + sge->vaddr = sge->mr->map[sge->m]->segs[sge->n].vaddr; + sge->length = sge->mr->map[sge->m]->segs[sge->n].length; + } +} + +static noinline struct verbs_txreq *__get_txreq(struct hfi1_ibdev *dev, + struct hfi1_qp *qp) +{ + struct verbs_txreq *tx; + unsigned long flags; + + spin_lock_irqsave(&qp->s_lock, flags); + spin_lock(&dev->pending_lock); + + if (!list_empty(&dev->txreq_free)) { + struct list_head *l = dev->txreq_free.next; + + list_del(l); + spin_unlock(&dev->pending_lock); + spin_unlock_irqrestore(&qp->s_lock, flags); + tx = list_entry(l, struct verbs_txreq, txreq.list); + tx->qp = qp; + atomic_inc(&qp->refcount); + } else { + if (ib_hfi1_state_ops[qp->state] & HFI1_PROCESS_RECV_OK && + list_empty(&qp->s_iowait.list)) { + dev->n_txwait++; + qp->s_flags |= HFI1_S_WAIT_TX; + list_add_tail(&qp->s_iowait.list, &dev->txwait); + trace_hfi1_qpsleep(qp, HFI1_S_WAIT_TX); + atomic_inc(&qp->refcount); + } + qp->s_flags &= ~HFI1_S_BUSY; + spin_unlock(&dev->pending_lock); + spin_unlock_irqrestore(&qp->s_lock, flags); + tx = ERR_PTR(-EBUSY); + } + return tx; +} + +static inline struct verbs_txreq *get_txreq(struct hfi1_ibdev *dev, + struct hfi1_qp *qp) +{ + struct verbs_txreq *tx; + unsigned long flags; + + spin_lock_irqsave(&dev->pending_lock, flags); + /* assume the list non empty */ + if (likely(!list_empty(&dev->txreq_free))) { + struct list_head *l = dev->txreq_free.next; + + list_del(l); + spin_unlock_irqrestore(&dev->pending_lock, flags); + tx = list_entry(l, struct verbs_txreq, txreq.list); + tx->qp = qp; + atomic_inc(&qp->refcount); + } else { + /* call slow path to get the extra lock */ + spin_unlock_irqrestore(&dev->pending_lock, flags); + tx = __get_txreq(dev, qp); + } + return tx; +} + +void hfi1_put_txreq(struct verbs_txreq *tx) +{ + struct hfi1_ibdev *dev; + struct hfi1_qp *qp; + unsigned long flags; + + qp = tx->qp; + dev = to_idev(qp->ibqp.device); + + if (atomic_dec_and_test(&qp->refcount)) + wake_up(&qp->wait); + if (tx->mr) { + hfi1_put_mr(tx->mr); + tx->mr = NULL; + } + sdma_txclean(dd_from_dev(dev), &tx->txreq); + + spin_lock_irqsave(&dev->pending_lock, flags); + + /* Put struct back on free list */ + list_add(&tx->txreq.list, &dev->txreq_free); + + if (!list_empty(&dev->txwait)) { + struct iowait *wait; + + /* Wake up first QP wanting a free struct */ + wait = list_first_entry(&dev->txwait, struct iowait, list); + qp = container_of(wait, struct hfi1_qp, s_iowait); + list_del_init(&qp->s_iowait.list); + /* refcount held until actual wake up */ + spin_unlock_irqrestore(&dev->pending_lock, flags); + hfi1_qp_wakeup(qp, HFI1_S_WAIT_TX); + } else + spin_unlock_irqrestore(&dev->pending_lock, flags); +} + +/* + * This is called with progress side lock held. + */ +/* New API */ +static void verbs_sdma_complete( + struct sdma_txreq *cookie, + int status, + int drained) +{ + struct verbs_txreq *tx = + container_of(cookie, struct verbs_txreq, txreq); + struct hfi1_qp *qp = tx->qp; + + spin_lock(&qp->s_lock); + if (tx->wqe) + hfi1_send_complete(qp, tx->wqe, IB_WC_SUCCESS); + else if (qp->ibqp.qp_type == IB_QPT_RC) { + struct hfi1_ib_header *hdr; + struct hfi1_ibdev *dev = to_idev(qp->ibqp.device); + + hdr = &dev->pio_hdrs[tx->hdr_inx].phdr.hdr; + hfi1_rc_send_complete(qp, hdr); + } + if (drained) { + /* + * This happens when the send engine notes + * a QP in the error state and cannot + * do the flush work until that QP's + * sdma work has finished. + */ + if (qp->s_flags & HFI1_S_WAIT_DMA) { + qp->s_flags &= ~HFI1_S_WAIT_DMA; + hfi1_schedule_send(qp); + } + } + spin_unlock(&qp->s_lock); + + hfi1_put_txreq(tx); +} + +static int wait_kmem(struct hfi1_ibdev *dev, struct hfi1_qp *qp) +{ + unsigned long flags; + int ret = 0; + + spin_lock_irqsave(&qp->s_lock, flags); + if (ib_hfi1_state_ops[qp->state] & HFI1_PROCESS_RECV_OK) { + spin_lock(&dev->pending_lock); + if (list_empty(&qp->s_iowait.list)) { + if (list_empty(&dev->memwait)) + mod_timer(&dev->mem_timer, jiffies + 1); + qp->s_flags |= HFI1_S_WAIT_KMEM; + list_add_tail(&qp->s_iowait.list, &dev->memwait); + trace_hfi1_qpsleep(qp, HFI1_S_WAIT_KMEM); + atomic_inc(&qp->refcount); + } + spin_unlock(&dev->pending_lock); + qp->s_flags &= ~HFI1_S_BUSY; + ret = -EBUSY; + } + spin_unlock_irqrestore(&qp->s_lock, flags); + + return ret; +} + +/* + * This routine calls txadds for each sg entry. + * + * Add failures will revert the sge cursor + */ +static int build_verbs_ulp_payload( + struct sdma_engine *sde, + struct hfi1_sge_state *ss, + u32 length, + struct verbs_txreq *tx) +{ + struct hfi1_sge *sg_list = ss->sg_list; + struct hfi1_sge sge = ss->sge; + u8 num_sge = ss->num_sge; + u32 len; + int ret = 0; + + while (length) { + len = ss->sge.length; + if (len > length) + len = length; + if (len > ss->sge.sge_length) + len = ss->sge.sge_length; + BUG_ON(len == 0); + ret = sdma_txadd_kvaddr( + sde->dd, + &tx->txreq, + ss->sge.vaddr, + len); + if (ret) + goto bail_txadd; + update_sge(ss, len); + length -= len; + } + return ret; +bail_txadd: + /* unwind cursor */ + ss->sge = sge; + ss->num_sge = num_sge; + ss->sg_list = sg_list; + return ret; +} + +/* + * Build the number of DMA descriptors needed to send length bytes of data. + * + * NOTE: DMA mapping is held in the tx until completed in the ring or + * the tx desc is freed without having been submitted to the ring + * + * This routine insures the following all the helper routine + * calls succeed. + */ +/* New API */ +static int build_verbs_tx_desc( + struct sdma_engine *sde, + struct hfi1_sge_state *ss, + u32 length, + struct verbs_txreq *tx, + struct ahg_ib_header *ahdr, + u64 pbc) +{ + struct hfi1_ibdev *dev = to_idev(tx->qp->ibqp.device); + int ret = 0; + struct hfi1_pio_header *phdr; + u16 hdrbytes = tx->hdr_dwords << 2; + + + phdr = &dev->pio_hdrs[tx->hdr_inx].phdr; + if (!ahdr->ahgcount) { + ret = sdma_txinit_ahg( + &tx->txreq, + ahdr->tx_flags, + hdrbytes + length, + ahdr->ahgidx, + 0, + NULL, + 0, + verbs_sdma_complete); + if (ret) + goto bail_txadd; + phdr->pbc = cpu_to_le64(pbc); + memcpy(&phdr->hdr, &ahdr->ibh, hdrbytes - sizeof(phdr->pbc)); + /* add the header */ + ret = sdma_txadd_daddr( + sde->dd, + &tx->txreq, + dev->pio_hdrs_phys + tx->hdr_inx * + sizeof(struct tx_pio_header), + tx->hdr_dwords << 2); + if (ret) + goto bail_txadd; + } else { + struct hfi1_other_headers *sohdr = &ahdr->ibh.u.oth; + struct hfi1_other_headers *dohdr = &phdr->hdr.u.oth; + + /* needed in rc_send_complete() */ + phdr->hdr.lrh[0] = ahdr->ibh.lrh[0]; + if ((be16_to_cpu(phdr->hdr.lrh[0]) & 3) == HFI1_LRH_GRH) { + sohdr = &ahdr->ibh.u.l.oth; + dohdr = &phdr->hdr.u.l.oth; + } + /* opcode */ + dohdr->bth[0] = sohdr->bth[0]; + /* PSN/ACK */ + dohdr->bth[2] = sohdr->bth[2]; + ret = sdma_txinit_ahg( + &tx->txreq, + ahdr->tx_flags, + length, + ahdr->ahgidx, + ahdr->ahgcount, + ahdr->ahgdesc, + hdrbytes, + verbs_sdma_complete); + if (ret) + goto bail_txadd; + } + + /* add the ulp payload - if any. ss can be NULL for acks */ + if (ss) + ret = build_verbs_ulp_payload(sde, ss, length, tx); +bail_txadd: + return ret; +} + +int hfi1_verbs_send_dma(struct hfi1_qp *qp, struct ahg_ib_header *ahdr, + u32 hdrwords, struct hfi1_sge_state *ss, u32 len, + u32 plen, u32 dwords, u64 pbc) +{ + struct hfi1_ibdev *dev = to_idev(qp->ibqp.device); + struct hfi1_ibport *ibp = to_iport(qp->ibqp.device, qp->port_num); + struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); + struct verbs_txreq *tx; + struct sdma_txreq *stx; + u64 pbc_flags = 0; + struct sdma_engine *sde; + u8 sc5 = qp->s_sc; + int ret; + + if (!list_empty(&qp->s_iowait.tx_head)) { + stx = list_first_entry( + &qp->s_iowait.tx_head, + struct sdma_txreq, + list); + list_del_init(&stx->list); + tx = container_of(stx, struct verbs_txreq, txreq); + ret = sdma_send_txreq(tx->sde, &qp->s_iowait, stx); + if (unlikely(ret == -ECOMM)) + goto bail_ecomm; + return ret; + } + + tx = get_txreq(dev, qp); + if (IS_ERR(tx)) + goto bail_tx; + + if (!qp->s_hdr->sde) + tx->sde = sde = qp_to_sdma_engine(qp, sc5); + else + tx->sde = sde = qp->s_hdr->sde; + + if (likely(pbc == 0)) { + u32 vl = sc_to_vlt(dd_from_ibdev(qp->ibqp.device), sc5); + /* No vl15 here */ + /* set PBC_DC_INFO bit (aka SC[4]) in pbc_flags */ + pbc_flags |= (!!(sc5 & 0x10)) << PBC_DC_INFO_SHIFT; + + pbc = create_pbc(ppd, pbc_flags, qp->srate_mbps, vl, plen); + } + tx->wqe = qp->s_wqe; + tx->mr = qp->s_rdma_mr; + if (qp->s_rdma_mr) + qp->s_rdma_mr = NULL; + tx->hdr_dwords = hdrwords + 2; + ret = build_verbs_tx_desc(sde, ss, len, tx, ahdr, pbc); + if (unlikely(ret)) + goto bail_build; + trace_output_ibhdr(dd_from_ibdev(qp->ibqp.device), &ahdr->ibh); + ret = sdma_send_txreq(sde, &qp->s_iowait, &tx->txreq); + if (unlikely(ret == -ECOMM)) + goto bail_ecomm; + return ret; +bail_ecomm: + /* The current one got "sent" */ + return 0; +bail_build: + /* kmalloc or mapping fail */ + hfi1_put_txreq(tx); + return wait_kmem(dev, qp); +bail_tx: + return PTR_ERR(tx); +} + +/* + * If we are now in the error state, return zero to flush the + * send work request. + */ +static int no_bufs_available(struct hfi1_qp *qp, struct send_context *sc) +{ + struct hfi1_devdata *dd = sc->dd; + struct hfi1_ibdev *dev = &dd->verbs_dev; + unsigned long flags; + int ret = 0; + + /* + * Note that as soon as want_buffer() is called and + * possibly before it returns, sc_piobufavail() + * could be called. Therefore, put QP on the I/O wait list before + * enabling the PIO avail interrupt. + */ + spin_lock_irqsave(&qp->s_lock, flags); + if (ib_hfi1_state_ops[qp->state] & HFI1_PROCESS_RECV_OK) { + spin_lock(&dev->pending_lock); + if (list_empty(&qp->s_iowait.list)) { + struct hfi1_ibdev *dev = &dd->verbs_dev; + int was_empty; + + dev->n_piowait++; + qp->s_flags |= HFI1_S_WAIT_PIO; + was_empty = list_empty(&sc->piowait); + list_add_tail(&qp->s_iowait.list, &sc->piowait); + trace_hfi1_qpsleep(qp, HFI1_S_WAIT_PIO); + atomic_inc(&qp->refcount); + /* counting: only call wantpiobuf_intr if first user */ + if (was_empty) + hfi1_sc_wantpiobuf_intr(sc, 1); + } + spin_unlock(&dev->pending_lock); + qp->s_flags &= ~HFI1_S_BUSY; + ret = -EBUSY; + } + spin_unlock_irqrestore(&qp->s_lock, flags); + return ret; +} + +struct send_context *qp_to_send_context(struct hfi1_qp *qp, u8 sc5) +{ + struct hfi1_devdata *dd = dd_from_ibdev(qp->ibqp.device); + struct hfi1_pportdata *ppd = dd->pport + (qp->port_num - 1); + u8 vl; + + vl = sc_to_vlt(dd, sc5); + if (vl >= hfi1_num_vls(ppd->vls_supported) && vl != 15) + return NULL; + return dd->vld[vl].sc; +} + +int hfi1_verbs_send_pio(struct hfi1_qp *qp, struct ahg_ib_header *ahdr, + u32 hdrwords, struct hfi1_sge_state *ss, u32 len, + u32 plen, u32 dwords, u64 pbc) +{ + struct hfi1_ibport *ibp = to_iport(qp->ibqp.device, qp->port_num); + struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); + u32 *hdr = (u32 *)&ahdr->ibh; + u64 pbc_flags = 0; + u32 sc5; + unsigned long flags = 0; + struct send_context *sc; + struct pio_buf *pbuf; + int wc_status = IB_WC_SUCCESS; + + /* vl15 special case taken care of in ud.c */ + sc5 = qp->s_sc; + sc = qp_to_send_context(qp, sc5); + + if (!sc) + return -EINVAL; + if (likely(pbc == 0)) { + u32 vl = sc_to_vlt(dd_from_ibdev(qp->ibqp.device), sc5); + /* set PBC_DC_INFO bit (aka SC[4]) in pbc_flags */ + pbc_flags |= (!!(sc5 & 0x10)) << PBC_DC_INFO_SHIFT; + pbc = create_pbc(ppd, pbc_flags, qp->srate_mbps, vl, plen); + } + pbuf = sc_buffer_alloc(sc, plen, NULL, NULL); + if (unlikely(pbuf == NULL)) { + if (ppd->host_link_state != HLS_UP_ACTIVE) { + /* + * If we have filled the PIO buffers to capacity and are + * not in an active state this request is not going to + * go out to so just complete it with an error or else a + * ULP or the core may be stuck waiting. + */ + hfi1_cdbg( + PIO, + "alloc failed. state not active, completing"); + wc_status = IB_WC_GENERAL_ERR; + goto pio_bail; + } else { + /* + * This is a normal occurrence. The PIO buffs are full + * up but we are still happily sending, well we could be + * so lets continue to queue the request. + */ + hfi1_cdbg(PIO, "alloc failed. state active, queuing"); + return no_bufs_available(qp, sc); + } + } + + if (len == 0) { + pio_copy(ppd->dd, pbuf, pbc, hdr, hdrwords); + } else { + if (ss) { + seg_pio_copy_start(pbuf, pbc, hdr, hdrwords*4); + while (len) { + void *addr = ss->sge.vaddr; + u32 slen = ss->sge.length; + + if (slen > len) + slen = len; + update_sge(ss, slen); + seg_pio_copy_mid(pbuf, addr, slen); + len -= slen; + } + seg_pio_copy_end(pbuf); + } + } + + trace_output_ibhdr(dd_from_ibdev(qp->ibqp.device), &ahdr->ibh); + + if (qp->s_rdma_mr) { + hfi1_put_mr(qp->s_rdma_mr); + qp->s_rdma_mr = NULL; + } + +pio_bail: + if (qp->s_wqe) { + spin_lock_irqsave(&qp->s_lock, flags); + hfi1_send_complete(qp, qp->s_wqe, wc_status); + spin_unlock_irqrestore(&qp->s_lock, flags); + } else if (qp->ibqp.qp_type == IB_QPT_RC) { + spin_lock_irqsave(&qp->s_lock, flags); + hfi1_rc_send_complete(qp, &ahdr->ibh); + spin_unlock_irqrestore(&qp->s_lock, flags); + } + return 0; +} +/* + * egress_pkey_matches_entry - return 1 if the pkey matches ent (ent + * being an entry from the ingress partition key table), return 0 + * otherwise. Use the matching criteria for egress partition keys + * specified in the OPAv1 spec., section 9.1l.7. + */ +static inline int egress_pkey_matches_entry(u16 pkey, u16 ent) +{ + u16 mkey = pkey & PKEY_LOW_15_MASK; + u16 ment = ent & PKEY_LOW_15_MASK; + + if (mkey == ment) { + /* + * If pkey[15] is set (full partition member), + * is bit 15 in the corresponding table element + * clear (limited member)? + */ + if (pkey & PKEY_MEMBER_MASK) + return !!(ent & PKEY_MEMBER_MASK); + return 1; + } + return 0; +} + +/* + * egress_pkey_check - return 0 if hdr's pkey matches according to the + * criteria in the OPAv1 spec., section 9.11.7. + */ +static inline int egress_pkey_check(struct hfi1_pportdata *ppd, + struct hfi1_ib_header *hdr, + struct hfi1_qp *qp) +{ + struct hfi1_other_headers *ohdr; + struct hfi1_devdata *dd; + int i = 0; + u16 pkey; + u8 lnh, sc5 = qp->s_sc; + + if (!(ppd->part_enforce & HFI1_PART_ENFORCE_OUT)) + return 0; + + /* locate the pkey within the headers */ + lnh = be16_to_cpu(hdr->lrh[0]) & 3; + if (lnh == HFI1_LRH_GRH) + ohdr = &hdr->u.l.oth; + else + ohdr = &hdr->u.oth; + + pkey = (u16)be32_to_cpu(ohdr->bth[0]); + + /* If SC15, pkey[0:14] must be 0x7fff */ + if ((sc5 == 0xf) && ((pkey & PKEY_LOW_15_MASK) != PKEY_LOW_15_MASK)) + goto bad; + + + /* Is the pkey = 0x0, or 0x8000? */ + if ((pkey & PKEY_LOW_15_MASK) == 0) + goto bad; + + /* The most likely matching pkey has index qp->s_pkey_index */ + if (!egress_pkey_matches_entry(pkey, ppd->pkeys[qp->s_pkey_index])) { + /* no match - try the entire table */ + for (; i < MAX_PKEY_VALUES; i++) { + if (egress_pkey_matches_entry(pkey, ppd->pkeys[i])) + break; + } + } + + if (i < MAX_PKEY_VALUES) + return 0; +bad: + incr_cntr64(&ppd->port_xmit_constraint_errors); + dd = ppd->dd; + if (!(dd->err_info_xmit_constraint.status & OPA_EI_STATUS_SMASK)) { + u16 slid = be16_to_cpu(hdr->lrh[3]); + + dd->err_info_xmit_constraint.status |= OPA_EI_STATUS_SMASK; + dd->err_info_xmit_constraint.slid = slid; + dd->err_info_xmit_constraint.pkey = pkey; + } + return 1; +} + +/** + * hfi1_verbs_send - send a packet + * @qp: the QP to send on + * @ahdr: the packet header + * @hdrwords: the number of 32-bit words in the header + * @ss: the SGE to send + * @len: the length of the packet in bytes + * + * Return zero if packet is sent or queued OK. + * Return non-zero and clear qp->s_flags HFI1_S_BUSY otherwise. + */ +int hfi1_verbs_send(struct hfi1_qp *qp, struct ahg_ib_header *ahdr, + u32 hdrwords, struct hfi1_sge_state *ss, u32 len) +{ + struct hfi1_devdata *dd = dd_from_ibdev(qp->ibqp.device); + u32 plen; + int ret; + int pio = 0; + unsigned long flags = 0; + u32 dwords = (len + 3) >> 2; + + /* + * VL15 packets (IB_QPT_SMI) will always use PIO, so we + * can defer SDMA restart until link goes ACTIVE without + * worrying about just how we got there. + */ + if ((qp->ibqp.qp_type == IB_QPT_SMI) || + !(dd->flags & HFI1_HAS_SEND_DMA)) + pio = 1; + + ret = egress_pkey_check(dd->pport, &ahdr->ibh, qp); + if (unlikely(ret)) { + /* + * The value we are returning here does not get propagated to + * the verbs caller. Thus we need to complete the request with + * error otherwise the caller could be sitting waiting on the + * completion event. Only do this for PIO. SDMA has its own + * mechanism for handling the errors. So for SDMA we can just + * return. + */ + if (pio) { + hfi1_cdbg(PIO, "%s() Failed. Completing with err", + __func__); + spin_lock_irqsave(&qp->s_lock, flags); + hfi1_send_complete(qp, qp->s_wqe, IB_WC_GENERAL_ERR); + spin_unlock_irqrestore(&qp->s_lock, flags); + } + return -EINVAL; + } + + /* + * Calculate the send buffer trigger address. + * The +2 counts for the pbc control qword + */ + plen = hdrwords + dwords + 2; + + if (pio) { + ret = dd->process_pio_send( + qp, ahdr, hdrwords, ss, len, plen, dwords, 0); + } else { +#ifdef CONFIG_SDMA_VERBOSITY + dd_dev_err(dd, "CONFIG SDMA %s:%d %s()\n", + slashstrip(__FILE__), __LINE__, __func__); + dd_dev_err(dd, "SDMA hdrwords = %u, len = %u\n", hdrwords, len); +#endif + ret = dd->process_dma_send( + qp, ahdr, hdrwords, ss, len, plen, dwords, 0); + } + + return ret; +} + +static int query_device(struct ib_device *ibdev, + struct ib_device_attr *props, + struct ib_udata *uhw) +{ + struct hfi1_devdata *dd = dd_from_ibdev(ibdev); + struct hfi1_ibdev *dev = to_idev(ibdev); + + if (uhw->inlen || uhw->outlen) + return -EINVAL; + props->device_cap_flags = IB_DEVICE_BAD_PKEY_CNTR | + IB_DEVICE_BAD_QKEY_CNTR | IB_DEVICE_SHUTDOWN_PORT | + IB_DEVICE_SYS_IMAGE_GUID | IB_DEVICE_RC_RNR_NAK_GEN | + IB_DEVICE_PORT_ACTIVE_EVENT | IB_DEVICE_SRQ_RESIZE; + + props->page_size_cap = PAGE_SIZE; + props->vendor_id = + dd->oui1 << 16 | dd->oui2 << 8 | dd->oui3; + props->vendor_part_id = dd->pcidev->device; + props->hw_ver = dd->minrev; + props->sys_image_guid = ib_hfi1_sys_image_guid; + props->max_mr_size = ~0ULL; + props->max_qp = hfi1_max_qps; + props->max_qp_wr = hfi1_max_qp_wrs; + props->max_sge = hfi1_max_sges; + props->max_cq = hfi1_max_cqs; + props->max_ah = hfi1_max_ahs; + props->max_cqe = hfi1_max_cqes; + props->max_mr = dev->lk_table.max; + props->max_fmr = dev->lk_table.max; + props->max_map_per_fmr = 32767; + props->max_pd = hfi1_max_pds; + props->max_qp_rd_atom = HFI1_MAX_RDMA_ATOMIC; + props->max_qp_init_rd_atom = 255; + /* props->max_res_rd_atom */ + props->max_srq = hfi1_max_srqs; + props->max_srq_wr = hfi1_max_srq_wrs; + props->max_srq_sge = hfi1_max_srq_sges; + /* props->local_ca_ack_delay */ + props->atomic_cap = IB_ATOMIC_GLOB; + props->max_pkeys = hfi1_get_npkeys(dd); + props->max_mcast_grp = hfi1_max_mcast_grps; + props->max_mcast_qp_attach = hfi1_max_mcast_qp_attached; + props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * + props->max_mcast_grp; + + return 0; +} + +static inline u16 opa_speed_to_ib(u16 in) +{ + u16 out = 0; + + if (in & OPA_LINK_SPEED_25G) + out |= IB_SPEED_EDR; + if (in & OPA_LINK_SPEED_12_5G) + out |= IB_SPEED_FDR; + + BUG_ON(!out); + return out; +} + +/* + * Convert a single OPA link width (no multiple flags) to an IB value. + * A zero OPA link width means link down, which means the IB width value + * is a don't care. + */ +static inline u16 opa_width_to_ib(u16 in) +{ + switch (in) { + case OPA_LINK_WIDTH_1X: + /* map 2x and 3x to 1x as they don't exist in IB */ + case OPA_LINK_WIDTH_2X: + case OPA_LINK_WIDTH_3X: + return IB_WIDTH_1X; + default: /* link down or unknown, return our largest width */ + case OPA_LINK_WIDTH_4X: + return IB_WIDTH_4X; + } +} + +static int query_port(struct ib_device *ibdev, u8 port, + struct ib_port_attr *props) +{ + struct hfi1_devdata *dd = dd_from_ibdev(ibdev); + struct hfi1_ibport *ibp = to_iport(ibdev, port); + struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); + u16 lid = ppd->lid; + + memset(props, 0, sizeof(*props)); + props->lid = lid ? lid : 0; + props->lmc = ppd->lmc; + props->sm_lid = ibp->sm_lid; + props->sm_sl = ibp->sm_sl; + /* OPA logical states match IB logical states */ + props->state = driver_lstate(ppd); + props->phys_state = hfi1_ibphys_portstate(ppd); + props->port_cap_flags = ibp->port_cap_flags; + props->gid_tbl_len = HFI1_GUIDS_PER_PORT; + props->max_msg_sz = 0x80000000; + props->pkey_tbl_len = hfi1_get_npkeys(dd); + props->bad_pkey_cntr = ibp->pkey_violations; + props->qkey_viol_cntr = ibp->qkey_violations; + props->active_width = (u8)opa_width_to_ib(ppd->link_width_active); + /* see rate_show() in ib core/sysfs.c */ + props->active_speed = (u8)opa_speed_to_ib(ppd->link_speed_active); + props->max_vl_num = hfi1_num_vls(ppd->vls_supported); + props->init_type_reply = 0; + + /* Once we are a "first class" citizen and have added the OPA MTUs to + * the core we can advertise the larger MTU enum to the ULPs, for now + * advertise only 4K. + * + * Those applications which are either OPA aware or pass the MTU enum + * from the Path Records to us will get the new 8k MTU. Those that + * attempt to process the MTU enum may fail in various ways. + */ + props->max_mtu = mtu_to_enum((!valid_ib_mtu(hfi1_max_mtu) ? + 4096 : hfi1_max_mtu), IB_MTU_4096); + props->active_mtu = !valid_ib_mtu(ppd->ibmtu) ? props->max_mtu : + mtu_to_enum(ppd->ibmtu, IB_MTU_2048); + props->subnet_timeout = ibp->subnet_timeout; + + return 0; +} + +static int port_immutable(struct ib_device *ibdev, u8 port_num, + struct ib_port_immutable *immutable) +{ + struct ib_port_attr attr; + int err; + + err = query_port(ibdev, port_num, &attr); + if (err) + return err; + + memset(immutable, 0, sizeof(*immutable)); + + immutable->pkey_tbl_len = attr.pkey_tbl_len; + immutable->gid_tbl_len = attr.gid_tbl_len; + immutable->core_cap_flags = RDMA_CORE_PORT_INTEL_OPA; + immutable->max_mad_size = OPA_MGMT_MAD_SIZE; + + return 0; +} + +static int modify_device(struct ib_device *device, + int device_modify_mask, + struct ib_device_modify *device_modify) +{ + struct hfi1_devdata *dd = dd_from_ibdev(device); + unsigned i; + int ret; + + if (device_modify_mask & ~(IB_DEVICE_MODIFY_SYS_IMAGE_GUID | + IB_DEVICE_MODIFY_NODE_DESC)) { + ret = -EOPNOTSUPP; + goto bail; + } + + if (device_modify_mask & IB_DEVICE_MODIFY_NODE_DESC) { + memcpy(device->node_desc, device_modify->node_desc, 64); + for (i = 0; i < dd->num_pports; i++) { + struct hfi1_ibport *ibp = &dd->pport[i].ibport_data; + + hfi1_node_desc_chg(ibp); + } + } + + if (device_modify_mask & IB_DEVICE_MODIFY_SYS_IMAGE_GUID) { + ib_hfi1_sys_image_guid = + cpu_to_be64(device_modify->sys_image_guid); + for (i = 0; i < dd->num_pports; i++) { + struct hfi1_ibport *ibp = &dd->pport[i].ibport_data; + + hfi1_sys_guid_chg(ibp); + } + } + + ret = 0; + +bail: + return ret; +} + +static int modify_port(struct ib_device *ibdev, u8 port, + int port_modify_mask, struct ib_port_modify *props) +{ + struct hfi1_ibport *ibp = to_iport(ibdev, port); + struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); + int ret = 0; + + ibp->port_cap_flags |= props->set_port_cap_mask; + ibp->port_cap_flags &= ~props->clr_port_cap_mask; + if (props->set_port_cap_mask || props->clr_port_cap_mask) + hfi1_cap_mask_chg(ibp); + if (port_modify_mask & IB_PORT_SHUTDOWN) { + set_link_down_reason(ppd, OPA_LINKDOWN_REASON_UNKNOWN, 0, + OPA_LINKDOWN_REASON_UNKNOWN); + ret = set_link_state(ppd, HLS_DN_DOWNDEF); + } + if (port_modify_mask & IB_PORT_RESET_QKEY_CNTR) + ibp->qkey_violations = 0; + return ret; +} + +static int query_gid(struct ib_device *ibdev, u8 port, + int index, union ib_gid *gid) +{ + struct hfi1_devdata *dd = dd_from_ibdev(ibdev); + int ret = 0; + + if (!port || port > dd->num_pports) + ret = -EINVAL; + else { + struct hfi1_ibport *ibp = to_iport(ibdev, port); + struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); + + gid->global.subnet_prefix = ibp->gid_prefix; + if (index == 0) + gid->global.interface_id = ppd->guid; + else if (index < HFI1_GUIDS_PER_PORT) + gid->global.interface_id = ibp->guids[index - 1]; + else + ret = -EINVAL; + } + + return ret; +} + +static struct ib_pd *alloc_pd(struct ib_device *ibdev, + struct ib_ucontext *context, + struct ib_udata *udata) +{ + struct hfi1_ibdev *dev = to_idev(ibdev); + struct hfi1_pd *pd; + struct ib_pd *ret; + + /* + * This is actually totally arbitrary. Some correctness tests + * assume there's a maximum number of PDs that can be allocated. + * We don't actually have this limit, but we fail the test if + * we allow allocations of more than we report for this value. + */ + + pd = kmalloc(sizeof(*pd), GFP_KERNEL); + if (!pd) { + ret = ERR_PTR(-ENOMEM); + goto bail; + } + + spin_lock(&dev->n_pds_lock); + if (dev->n_pds_allocated == hfi1_max_pds) { + spin_unlock(&dev->n_pds_lock); + kfree(pd); + ret = ERR_PTR(-ENOMEM); + goto bail; + } + + dev->n_pds_allocated++; + spin_unlock(&dev->n_pds_lock); + + /* ib_alloc_pd() will initialize pd->ibpd. */ + pd->user = udata != NULL; + + ret = &pd->ibpd; + +bail: + return ret; +} + +static int dealloc_pd(struct ib_pd *ibpd) +{ + struct hfi1_pd *pd = to_ipd(ibpd); + struct hfi1_ibdev *dev = to_idev(ibpd->device); + + spin_lock(&dev->n_pds_lock); + dev->n_pds_allocated--; + spin_unlock(&dev->n_pds_lock); + + kfree(pd); + + return 0; +} + +/* + * convert ah port,sl to sc + */ +u8 ah_to_sc(struct ib_device *ibdev, struct ib_ah_attr *ah) +{ + struct hfi1_ibport *ibp = to_iport(ibdev, ah->port_num); + + return ibp->sl_to_sc[ah->sl]; +} + +int hfi1_check_ah(struct ib_device *ibdev, struct ib_ah_attr *ah_attr) +{ + struct hfi1_ibport *ibp; + struct hfi1_pportdata *ppd; + struct hfi1_devdata *dd; + u8 sc5; + + /* A multicast address requires a GRH (see ch. 8.4.1). */ + if (ah_attr->dlid >= HFI1_MULTICAST_LID_BASE && + ah_attr->dlid != HFI1_PERMISSIVE_LID && + !(ah_attr->ah_flags & IB_AH_GRH)) + goto bail; + if ((ah_attr->ah_flags & IB_AH_GRH) && + ah_attr->grh.sgid_index >= HFI1_GUIDS_PER_PORT) + goto bail; + if (ah_attr->dlid == 0) + goto bail; + if (ah_attr->port_num < 1 || + ah_attr->port_num > ibdev->phys_port_cnt) + goto bail; + if (ah_attr->static_rate != IB_RATE_PORT_CURRENT && + ib_rate_to_mbps(ah_attr->static_rate) < 0) + goto bail; + if (ah_attr->sl >= OPA_MAX_SLS) + goto bail; + /* test the mapping for validity */ + ibp = to_iport(ibdev, ah_attr->port_num); + ppd = ppd_from_ibp(ibp); + sc5 = ibp->sl_to_sc[ah_attr->sl]; + dd = dd_from_ppd(ppd); + if (sc_to_vlt(dd, sc5) > num_vls) + goto bail; + return 0; +bail: + return -EINVAL; +} + +/** + * create_ah - create an address handle + * @pd: the protection domain + * @ah_attr: the attributes of the AH + * + * This may be called from interrupt context. + */ +static struct ib_ah *create_ah(struct ib_pd *pd, + struct ib_ah_attr *ah_attr) +{ + struct hfi1_ah *ah; + struct ib_ah *ret; + struct hfi1_ibdev *dev = to_idev(pd->device); + unsigned long flags; + + if (hfi1_check_ah(pd->device, ah_attr)) { + ret = ERR_PTR(-EINVAL); + goto bail; + } + + ah = kmalloc(sizeof(*ah), GFP_ATOMIC); + if (!ah) { + ret = ERR_PTR(-ENOMEM); + goto bail; + } + + spin_lock_irqsave(&dev->n_ahs_lock, flags); + if (dev->n_ahs_allocated == hfi1_max_ahs) { + spin_unlock_irqrestore(&dev->n_ahs_lock, flags); + kfree(ah); + ret = ERR_PTR(-ENOMEM); + goto bail; + } + + dev->n_ahs_allocated++; + spin_unlock_irqrestore(&dev->n_ahs_lock, flags); + + /* ib_create_ah() will initialize ah->ibah. */ + ah->attr = *ah_attr; + atomic_set(&ah->refcount, 0); + + ret = &ah->ibah; + +bail: + return ret; +} + +struct ib_ah *hfi1_create_qp0_ah(struct hfi1_ibport *ibp, u16 dlid) +{ + struct ib_ah_attr attr; + struct ib_ah *ah = ERR_PTR(-EINVAL); + struct hfi1_qp *qp0; + + memset(&attr, 0, sizeof(attr)); + attr.dlid = dlid; + attr.port_num = ppd_from_ibp(ibp)->port; + rcu_read_lock(); + qp0 = rcu_dereference(ibp->qp0); + if (qp0) + ah = ib_create_ah(qp0->ibqp.pd, &attr); + rcu_read_unlock(); + return ah; +} + +/** + * destroy_ah - destroy an address handle + * @ibah: the AH to destroy + * + * This may be called from interrupt context. + */ +static int destroy_ah(struct ib_ah *ibah) +{ + struct hfi1_ibdev *dev = to_idev(ibah->device); + struct hfi1_ah *ah = to_iah(ibah); + unsigned long flags; + + if (atomic_read(&ah->refcount) != 0) + return -EBUSY; + + spin_lock_irqsave(&dev->n_ahs_lock, flags); + dev->n_ahs_allocated--; + spin_unlock_irqrestore(&dev->n_ahs_lock, flags); + + kfree(ah); + + return 0; +} + +static int modify_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr) +{ + struct hfi1_ah *ah = to_iah(ibah); + + if (hfi1_check_ah(ibah->device, ah_attr)) + return -EINVAL; + + ah->attr = *ah_attr; + + return 0; +} + +static int query_ah(struct ib_ah *ibah, struct ib_ah_attr *ah_attr) +{ + struct hfi1_ah *ah = to_iah(ibah); + + *ah_attr = ah->attr; + + return 0; +} + +/** + * hfi1_get_npkeys - return the size of the PKEY table for context 0 + * @dd: the hfi1_ib device + */ +unsigned hfi1_get_npkeys(struct hfi1_devdata *dd) +{ + return ARRAY_SIZE(dd->pport[0].pkeys); +} + +/* + * Return the indexed PKEY from the port PKEY table. + */ +unsigned hfi1_get_pkey(struct hfi1_ibport *ibp, unsigned index) +{ + struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); + unsigned ret; + + if (index >= ARRAY_SIZE(ppd->pkeys)) + ret = 0; + else + ret = ppd->pkeys[index]; + + return ret; +} + +static int query_pkey(struct ib_device *ibdev, u8 port, u16 index, + u16 *pkey) +{ + struct hfi1_devdata *dd = dd_from_ibdev(ibdev); + int ret; + + if (index >= hfi1_get_npkeys(dd)) { + ret = -EINVAL; + goto bail; + } + + *pkey = hfi1_get_pkey(to_iport(ibdev, port), index); + ret = 0; + +bail: + return ret; +} + +/** + * alloc_ucontext - allocate a ucontest + * @ibdev: the infiniband device + * @udata: not used by the driver + */ + +static struct ib_ucontext *alloc_ucontext(struct ib_device *ibdev, + struct ib_udata *udata) +{ + struct hfi1_ucontext *context; + struct ib_ucontext *ret; + + context = kmalloc(sizeof(*context), GFP_KERNEL); + if (!context) { + ret = ERR_PTR(-ENOMEM); + goto bail; + } + + ret = &context->ibucontext; + +bail: + return ret; +} + +static int dealloc_ucontext(struct ib_ucontext *context) +{ + kfree(to_iucontext(context)); + return 0; +} + +static void init_ibport(struct hfi1_pportdata *ppd) +{ + struct hfi1_ibport *ibp = &ppd->ibport_data; + size_t sz = ARRAY_SIZE(ibp->sl_to_sc); + int i; + + for (i = 0; i < sz; i++) { + ibp->sl_to_sc[i] = i; + ibp->sc_to_sl[i] = i; + } + + spin_lock_init(&ibp->lock); + /* Set the prefix to the default value (see ch. 4.1.1) */ + ibp->gid_prefix = IB_DEFAULT_GID_PREFIX; + ibp->sm_lid = 0; + /* Below should only set bits defined in OPA PortInfo.CapabilityMask */ + ibp->port_cap_flags = IB_PORT_AUTO_MIGR_SUP | + IB_PORT_CAP_MASK_NOTICE_SUP; + ibp->pma_counter_select[0] = IB_PMA_PORT_XMIT_DATA; + ibp->pma_counter_select[1] = IB_PMA_PORT_RCV_DATA; + ibp->pma_counter_select[2] = IB_PMA_PORT_XMIT_PKTS; + ibp->pma_counter_select[3] = IB_PMA_PORT_RCV_PKTS; + ibp->pma_counter_select[4] = IB_PMA_PORT_XMIT_WAIT; + + RCU_INIT_POINTER(ibp->qp0, NULL); + RCU_INIT_POINTER(ibp->qp1, NULL); +} + +/** + * hfi1_register_ib_device - register our device with the infiniband core + * @dd: the device data structure + * Return 0 if successful, errno if unsuccessful. + */ +int hfi1_register_ib_device(struct hfi1_devdata *dd) +{ + struct hfi1_ibdev *dev = &dd->verbs_dev; + struct ib_device *ibdev = &dev->ibdev; + struct hfi1_pportdata *ppd = dd->pport; + unsigned i, lk_tab_size; + int ret; + size_t lcpysz = IB_DEVICE_NAME_MAX; + u16 descq_cnt; + + ret = hfi1_qp_init(dev); + if (ret) + goto err_qp_init; + + + for (i = 0; i < dd->num_pports; i++) + init_ibport(ppd + i); + + /* Only need to initialize non-zero fields. */ + spin_lock_init(&dev->n_pds_lock); + spin_lock_init(&dev->n_ahs_lock); + spin_lock_init(&dev->n_cqs_lock); + spin_lock_init(&dev->n_qps_lock); + spin_lock_init(&dev->n_srqs_lock); + spin_lock_init(&dev->n_mcast_grps_lock); + init_timer(&dev->mem_timer); + dev->mem_timer.function = mem_timer; + dev->mem_timer.data = (unsigned long) dev; + + /* + * The top hfi1_lkey_table_size bits are used to index the + * table. The lower 8 bits can be owned by the user (copied from + * the LKEY). The remaining bits act as a generation number or tag. + */ + spin_lock_init(&dev->lk_table.lock); + dev->lk_table.max = 1 << hfi1_lkey_table_size; + lk_tab_size = dev->lk_table.max * sizeof(*dev->lk_table.table); + dev->lk_table.table = (struct hfi1_mregion __rcu **) + __get_free_pages(GFP_KERNEL, get_order(lk_tab_size)); + if (dev->lk_table.table == NULL) { + ret = -ENOMEM; + goto err_lk; + } + RCU_INIT_POINTER(dev->dma_mr, NULL); + for (i = 0; i < dev->lk_table.max; i++) + RCU_INIT_POINTER(dev->lk_table.table[i], NULL); + INIT_LIST_HEAD(&dev->pending_mmaps); + spin_lock_init(&dev->pending_lock); + dev->mmap_offset = PAGE_SIZE; + spin_lock_init(&dev->mmap_offset_lock); + INIT_LIST_HEAD(&dev->txwait); + INIT_LIST_HEAD(&dev->memwait); + INIT_LIST_HEAD(&dev->txreq_free); + + descq_cnt = sdma_get_descq_cnt(); + /* + * AHG mode copy requires header be on cache line + */ + dev->pio_hdr_bytes = descq_cnt * sizeof(struct tx_pio_header); + if (descq_cnt) { + dev->pio_hdrs = dma_zalloc_coherent(&dd->pcidev->dev, + dev->pio_hdr_bytes, + &dev->pio_hdrs_phys, + GFP_KERNEL); + if (!dev->pio_hdrs) { + ret = -ENOMEM; + goto err_hdrs; + } + } + + for (i = 0; i < descq_cnt; i++) { + struct verbs_txreq *tx; + + tx = kzalloc(sizeof(*tx), GFP_KERNEL); + if (!tx) { + ret = -ENOMEM; + goto err_tx; + } + tx->hdr_inx = i; + list_add(&tx->txreq.list, &dev->txreq_free); + } + + /* + * The system image GUID is supposed to be the same for all + * IB HCAs in a single system but since there can be other + * device types in the system, we can't be sure this is unique. + */ + if (!ib_hfi1_sys_image_guid) + ib_hfi1_sys_image_guid = ppd->guid; + lcpysz = strlcpy(ibdev->name, class_name(), lcpysz); + strlcpy(ibdev->name + lcpysz, "_%d", IB_DEVICE_NAME_MAX - lcpysz); + ibdev->owner = THIS_MODULE; + ibdev->node_guid = ppd->guid; + ibdev->uverbs_abi_ver = HFI1_UVERBS_ABI_VERSION; + ibdev->uverbs_cmd_mask = + (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) | + (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_CMD_QUERY_PORT) | + (1ull << IB_USER_VERBS_CMD_ALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) | + (1ull << IB_USER_VERBS_CMD_CREATE_AH) | + (1ull << IB_USER_VERBS_CMD_MODIFY_AH) | + (1ull << IB_USER_VERBS_CMD_QUERY_AH) | + (1ull << IB_USER_VERBS_CMD_DESTROY_AH) | + (1ull << IB_USER_VERBS_CMD_REG_MR) | + (1ull << IB_USER_VERBS_CMD_DEREG_MR) | + (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) | + (1ull << IB_USER_VERBS_CMD_CREATE_CQ) | + (1ull << IB_USER_VERBS_CMD_RESIZE_CQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) | + (1ull << IB_USER_VERBS_CMD_POLL_CQ) | + (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) | + (1ull << IB_USER_VERBS_CMD_CREATE_QP) | + (1ull << IB_USER_VERBS_CMD_QUERY_QP) | + (1ull << IB_USER_VERBS_CMD_MODIFY_QP) | + (1ull << IB_USER_VERBS_CMD_DESTROY_QP) | + (1ull << IB_USER_VERBS_CMD_POST_SEND) | + (1ull << IB_USER_VERBS_CMD_POST_RECV) | + (1ull << IB_USER_VERBS_CMD_ATTACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_DETACH_MCAST) | + (1ull << IB_USER_VERBS_CMD_CREATE_SRQ) | + (1ull << IB_USER_VERBS_CMD_MODIFY_SRQ) | + (1ull << IB_USER_VERBS_CMD_QUERY_SRQ) | + (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | + (1ull << IB_USER_VERBS_CMD_POST_SRQ_RECV); + ibdev->node_type = RDMA_NODE_IB_CA; + ibdev->phys_port_cnt = dd->num_pports; + ibdev->num_comp_vectors = 1; + ibdev->dma_device = &dd->pcidev->dev; + ibdev->query_device = query_device; + ibdev->modify_device = modify_device; + ibdev->query_port = query_port; + ibdev->modify_port = modify_port; + ibdev->query_pkey = query_pkey; + ibdev->query_gid = query_gid; + ibdev->alloc_ucontext = alloc_ucontext; + ibdev->dealloc_ucontext = dealloc_ucontext; + ibdev->alloc_pd = alloc_pd; + ibdev->dealloc_pd = dealloc_pd; + ibdev->create_ah = create_ah; + ibdev->destroy_ah = destroy_ah; + ibdev->modify_ah = modify_ah; + ibdev->query_ah = query_ah; + ibdev->create_srq = hfi1_create_srq; + ibdev->modify_srq = hfi1_modify_srq; + ibdev->query_srq = hfi1_query_srq; + ibdev->destroy_srq = hfi1_destroy_srq; + ibdev->create_qp = hfi1_create_qp; + ibdev->modify_qp = hfi1_modify_qp; + ibdev->query_qp = hfi1_query_qp; + ibdev->destroy_qp = hfi1_destroy_qp; + ibdev->post_send = post_send; + ibdev->post_recv = post_receive; + ibdev->post_srq_recv = hfi1_post_srq_receive; + ibdev->create_cq = hfi1_create_cq; + ibdev->destroy_cq = hfi1_destroy_cq; + ibdev->resize_cq = hfi1_resize_cq; + ibdev->poll_cq = hfi1_poll_cq; + ibdev->req_notify_cq = hfi1_req_notify_cq; + ibdev->get_dma_mr = hfi1_get_dma_mr; + ibdev->reg_phys_mr = hfi1_reg_phys_mr; + ibdev->reg_user_mr = hfi1_reg_user_mr; + ibdev->dereg_mr = hfi1_dereg_mr; + ibdev->alloc_fast_reg_mr = hfi1_alloc_fast_reg_mr; + ibdev->alloc_fast_reg_page_list = hfi1_alloc_fast_reg_page_list; + ibdev->free_fast_reg_page_list = hfi1_free_fast_reg_page_list; + ibdev->alloc_fmr = hfi1_alloc_fmr; + ibdev->map_phys_fmr = hfi1_map_phys_fmr; + ibdev->unmap_fmr = hfi1_unmap_fmr; + ibdev->dealloc_fmr = hfi1_dealloc_fmr; + ibdev->attach_mcast = hfi1_multicast_attach; + ibdev->detach_mcast = hfi1_multicast_detach; + ibdev->process_mad = hfi1_process_mad; + ibdev->mmap = hfi1_mmap; + ibdev->dma_ops = &hfi1_dma_mapping_ops; + ibdev->get_port_immutable = port_immutable; + + strncpy(ibdev->node_desc, init_utsname()->nodename, + sizeof(ibdev->node_desc)); + + ret = ib_register_device(ibdev, hfi1_create_port_files); + if (ret) + goto err_reg; + + ret = hfi1_create_agents(dev); + if (ret) + goto err_agents; + + ret = hfi1_verbs_register_sysfs(dd); + if (ret) + goto err_class; + + goto bail; + +err_class: + hfi1_free_agents(dev); +err_agents: + ib_unregister_device(ibdev); +err_reg: +err_tx: + while (!list_empty(&dev->txreq_free)) { + struct list_head *l = dev->txreq_free.next; + struct verbs_txreq *tx; + + list_del(l); + tx = list_entry(l, struct verbs_txreq, txreq.list); + kfree(tx); + } + if (dev->pio_hdrs) + dma_free_coherent(&dd->pcidev->dev, + dev->pio_hdr_bytes, + dev->pio_hdrs, dev->pio_hdrs_phys); +err_hdrs: + free_pages((unsigned long) dev->lk_table.table, get_order(lk_tab_size)); +err_lk: + hfi1_qp_exit(dev); +err_qp_init: + dd_dev_err(dd, "cannot register verbs: %d!\n", -ret); +bail: + return ret; +} + +void hfi1_unregister_ib_device(struct hfi1_devdata *dd) +{ + struct hfi1_ibdev *dev = &dd->verbs_dev; + struct ib_device *ibdev = &dev->ibdev; + unsigned lk_tab_size; + + hfi1_verbs_unregister_sysfs(dd); + + hfi1_free_agents(dev); + + ib_unregister_device(ibdev); + + if (!list_empty(&dev->txwait)) + dd_dev_err(dd, "txwait list not empty!\n"); + if (!list_empty(&dev->memwait)) + dd_dev_err(dd, "memwait list not empty!\n"); + if (dev->dma_mr) + dd_dev_err(dd, "DMA MR not NULL!\n"); + + hfi1_qp_exit(dev); + del_timer_sync(&dev->mem_timer); + while (!list_empty(&dev->txreq_free)) { + struct list_head *l = dev->txreq_free.next; + struct verbs_txreq *tx; + + list_del(l); + tx = list_entry(l, struct verbs_txreq, txreq.list); + kfree(tx); + } + if (dev->pio_hdrs) + dma_free_coherent(&dd->pcidev->dev, + dev->pio_hdr_bytes, + dev->pio_hdrs, dev->pio_hdrs_phys); + lk_tab_size = dev->lk_table.max * sizeof(*dev->lk_table.table); + free_pages((unsigned long) dev->lk_table.table, + get_order(lk_tab_size)); +} + +/* + * This must be called with s_lock held. + */ +void hfi1_schedule_send(struct hfi1_qp *qp) +{ + if (hfi1_send_ok(qp)) { + struct hfi1_ibport *ibp = + to_iport(qp->ibqp.device, qp->port_num); + struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); + + iowait_schedule(&qp->s_iowait, ppd->hfi1_wq); + } +} diff --git a/drivers/infiniband/hw/hfi1/verbs.h b/drivers/infiniband/hw/hfi1/verbs.h new file mode 100644 index 0000000..c557e8e --- /dev/null +++ b/drivers/infiniband/hw/hfi1/verbs.h @@ -0,0 +1,1193 @@ +/* + * + * This file is provided under a dual BSD/GPLv2 license. When using or + * redistributing this file, you may do so under either license. + * + * GPL LICENSE SUMMARY + * + * Copyright(c) 2015 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * BSD LICENSE + * + * Copyright(c) 2015 Intel Corporation. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * - Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * - Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in + * the documentation and/or other materials provided with the + * distribution. + * - Neither the name of Intel Corporation nor the names of its + * contributors may be used to endorse or promote products derived + * from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + */ + +#ifndef HFI1_VERBS_H +#define HFI1_VERBS_H + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct hfi1_ctxtdata; +struct hfi1_pportdata; +struct hfi1_devdata; +struct hfi1_packet; + +#include "iowait.h" + +#define HFI1_MAX_RDMA_ATOMIC 16 +#define HFI1_GUIDS_PER_PORT 5 + +/* + * Increment this value if any changes that break userspace ABI + * compatibility are made. + */ +#define HFI1_UVERBS_ABI_VERSION 2 + +/* + * Define an ib_cq_notify value that is not valid so we know when CQ + * notifications are armed. + */ +#define IB_CQ_NONE (IB_CQ_NEXT_COMP + 1) + +#define IB_SEQ_NAK (3 << 29) + +/* AETH NAK opcode values */ +#define IB_RNR_NAK 0x20 +#define IB_NAK_PSN_ERROR 0x60 +#define IB_NAK_INVALID_REQUEST 0x61 +#define IB_NAK_REMOTE_ACCESS_ERROR 0x62 +#define IB_NAK_REMOTE_OPERATIONAL_ERROR 0x63 +#define IB_NAK_INVALID_RD_REQUEST 0x64 + +/* Flags for checking QP state (see ib_hfi1_state_ops[]) */ +#define HFI1_POST_SEND_OK 0x01 +#define HFI1_POST_RECV_OK 0x02 +#define HFI1_PROCESS_RECV_OK 0x04 +#define HFI1_PROCESS_SEND_OK 0x08 +#define HFI1_PROCESS_NEXT_SEND_OK 0x10 +#define HFI1_FLUSH_SEND 0x20 +#define HFI1_FLUSH_RECV 0x40 +#define HFI1_PROCESS_OR_FLUSH_SEND \ + (HFI1_PROCESS_SEND_OK | HFI1_FLUSH_SEND) + +/* IB Performance Manager status values */ +#define IB_PMA_SAMPLE_STATUS_DONE 0x00 +#define IB_PMA_SAMPLE_STATUS_STARTED 0x01 +#define IB_PMA_SAMPLE_STATUS_RUNNING 0x02 + +/* Mandatory IB performance counter select values. */ +#define IB_PMA_PORT_XMIT_DATA cpu_to_be16(0x0001) +#define IB_PMA_PORT_RCV_DATA cpu_to_be16(0x0002) +#define IB_PMA_PORT_XMIT_PKTS cpu_to_be16(0x0003) +#define IB_PMA_PORT_RCV_PKTS cpu_to_be16(0x0004) +#define IB_PMA_PORT_XMIT_WAIT cpu_to_be16(0x0005) + +#define HFI1_VENDOR_IPG cpu_to_be16(0xFFA0) + +#define IB_BTH_REQ_ACK (1 << 31) +#define IB_BTH_SOLICITED (1 << 23) +#define IB_BTH_MIG_REQ (1 << 22) + +#define IB_GRH_VERSION 6 +#define IB_GRH_VERSION_MASK 0xF +#define IB_GRH_VERSION_SHIFT 28 +#define IB_GRH_TCLASS_MASK 0xFF +#define IB_GRH_TCLASS_SHIFT 20 +#define IB_GRH_FLOW_MASK 0xFFFFF +#define IB_GRH_FLOW_SHIFT 0 +#define IB_GRH_NEXT_HDR 0x1B + +#define IB_DEFAULT_GID_PREFIX cpu_to_be64(0xfe80000000000000ULL) + +/* Values for set/get portinfo VLCap OperationalVLs */ +#define IB_VL_VL0 1 +#define IB_VL_VL0_1 2 +#define IB_VL_VL0_3 3 +#define IB_VL_VL0_7 4 +#define IB_VL_VL0_14 5 + +/* flags passed by hfi1_ib_rcv() */ +enum { + HFI1_HAS_GRH = (1 << 0), + HFI1_SC4_BIT = (1 << 1), /* indicates the DC set the SC[4] bit */ +}; + +static inline int hfi1_num_vls(int vls) +{ + switch (vls) { + default: + case IB_VL_VL0: + return 1; + case IB_VL_VL0_1: + return 2; + case IB_VL_VL0_3: + return 4; + case IB_VL_VL0_7: + return 8; + case IB_VL_VL0_14: + return 15; + } +} + +static inline int hfi1_vls_to_ib_enum(u8 num_vls) +{ + switch (num_vls) { + case 1: + return IB_VL_VL0; + case 2: + return IB_VL_VL0_1; + case 4: + return IB_VL_VL0_3; + case 8: + return IB_VL_VL0_7; + case 15: + return IB_VL_VL0_14; + default: + return -1; + } +} + +struct ib_reth { + __be64 vaddr; + __be32 rkey; + __be32 length; +} __packed; + +struct ib_atomic_eth { + __be32 vaddr[2]; /* unaligned so access as 2 32-bit words */ + __be32 rkey; + __be64 swap_data; + __be64 compare_data; +} __packed; + +union ib_ehdrs { + struct { + __be32 deth[2]; + __be32 imm_data; + } ud; + struct { + struct ib_reth reth; + __be32 imm_data; + } rc; + struct { + __be32 aeth; + __be32 atomic_ack_eth[2]; + } at; + __be32 imm_data; + __be32 aeth; + struct ib_atomic_eth atomic_eth; +} __packed; + +struct hfi1_other_headers { + __be32 bth[3]; + union ib_ehdrs u; +} __packed; + +/* + * Note that UD packets with a GRH header are 8+40+12+8 = 68 bytes + * long (72 w/ imm_data). Only the first 56 bytes of the IB header + * will be in the eager header buffer. The remaining 12 or 16 bytes + * are in the data buffer. + */ +struct hfi1_ib_header { + __be16 lrh[4]; + union { + struct { + struct ib_grh grh; + struct hfi1_other_headers oth; + } l; + struct hfi1_other_headers oth; + } u; +} __packed; + +struct ahg_ib_header { + struct sdma_engine *sde; + u32 ahgdesc[2]; + u16 tx_flags; + u8 ahgcount; + u8 ahgidx; + struct hfi1_ib_header ibh; +}; + +struct hfi1_pio_header { + __le64 pbc; + struct hfi1_ib_header hdr; +} __packed; + +/* + * used for force cacheline alignment for AHG + */ +struct tx_pio_header { + struct hfi1_pio_header phdr; +} ____cacheline_aligned; + +/* + * There is one struct hfi1_mcast for each multicast GID. + * All attached QPs are then stored as a list of + * struct hfi1_mcast_qp. + */ +struct hfi1_mcast_qp { + struct list_head list; + struct hfi1_qp *qp; +}; + +struct hfi1_mcast { + struct rb_node rb_node; + union ib_gid mgid; + struct list_head qp_list; + wait_queue_head_t wait; + atomic_t refcount; + int n_attached; +}; + +/* Protection domain */ +struct hfi1_pd { + struct ib_pd ibpd; + int user; /* non-zero if created from user space */ +}; + +/* Address Handle */ +struct hfi1_ah { + struct ib_ah ibah; + struct ib_ah_attr attr; + atomic_t refcount; +}; + +/* + * This structure is used by hfi1_mmap() to validate an offset + * when an mmap() request is made. The vm_area_struct then uses + * this as its vm_private_data. + */ +struct hfi1_mmap_info { + struct list_head pending_mmaps; + struct ib_ucontext *context; + void *obj; + __u64 offset; + struct kref ref; + unsigned size; +}; + +/* + * This structure is used to contain the head pointer, tail pointer, + * and completion queue entries as a single memory allocation so + * it can be mmap'ed into user space. + */ +struct hfi1_cq_wc { + u32 head; /* index of next entry to fill */ + u32 tail; /* index of next ib_poll_cq() entry */ + union { + /* these are actually size ibcq.cqe + 1 */ + struct ib_uverbs_wc uqueue[0]; + struct ib_wc kqueue[0]; + }; +}; + +/* + * The completion queue structure. + */ +struct hfi1_cq { + struct ib_cq ibcq; + struct kthread_work comptask; + struct hfi1_devdata *dd; + spinlock_t lock; /* protect changes in this struct */ + u8 notify; + u8 triggered; + struct hfi1_cq_wc *queue; + struct hfi1_mmap_info *ip; +}; + +/* + * A segment is a linear region of low physical memory. + * Used by the verbs layer. + */ +struct hfi1_seg { + void *vaddr; + size_t length; +}; + +/* The number of hfi1_segs that fit in a page. */ +#define HFI1_SEGSZ (PAGE_SIZE / sizeof(struct hfi1_seg)) + +struct hfi1_segarray { + struct hfi1_seg segs[HFI1_SEGSZ]; +}; + +struct hfi1_mregion { + struct ib_pd *pd; /* shares refcnt of ibmr.pd */ + u64 user_base; /* User's address for this region */ + u64 iova; /* IB start address of this region */ + size_t length; + u32 lkey; + u32 offset; /* offset (bytes) to start of region */ + int access_flags; + u32 max_segs; /* number of hfi1_segs in all the arrays */ + u32 mapsz; /* size of the map array */ + u8 page_shift; /* 0 - non unform/non powerof2 sizes */ + u8 lkey_published; /* in global table */ + struct completion comp; /* complete when refcount goes to zero */ + atomic_t refcount; + struct hfi1_segarray *map[0]; /* the segments */ +}; + +/* + * These keep track of the copy progress within a memory region. + * Used by the verbs layer. + */ +struct hfi1_sge { + struct hfi1_mregion *mr; + void *vaddr; /* kernel virtual address of segment */ + u32 sge_length; /* length of the SGE */ + u32 length; /* remaining length of the segment */ + u16 m; /* current index: mr->map[m] */ + u16 n; /* current index: mr->map[m]->segs[n] */ +}; + +/* Memory region */ +struct hfi1_mr { + struct ib_mr ibmr; + struct ib_umem *umem; + struct hfi1_mregion mr; /* must be last */ +}; + +/* + * Send work request queue entry. + * The size of the sg_list is determined when the QP is created and stored + * in qp->s_max_sge. + */ +struct hfi1_swqe { + struct ib_send_wr wr; /* don't use wr.sg_list */ + u32 psn; /* first packet sequence number */ + u32 lpsn; /* last packet sequence number */ + u32 ssn; /* send sequence number */ + u32 length; /* total length of data in sg_list */ + struct hfi1_sge sg_list[0]; +}; + +/* + * Receive work request queue entry. + * The size of the sg_list is determined when the QP (or SRQ) is created + * and stored in qp->r_rq.max_sge (or srq->rq.max_sge). + */ +struct hfi1_rwqe { + u64 wr_id; + u8 num_sge; + struct ib_sge sg_list[0]; +}; + +/* + * This structure is used to contain the head pointer, tail pointer, + * and receive work queue entries as a single memory allocation so + * it can be mmap'ed into user space. + * Note that the wq array elements are variable size so you can't + * just index into the array to get the N'th element; + * use get_rwqe_ptr() instead. + */ +struct hfi1_rwq { + u32 head; /* new work requests posted to the head */ + u32 tail; /* receives pull requests from here. */ + struct hfi1_rwqe wq[0]; +}; + +struct hfi1_rq { + struct hfi1_rwq *wq; + u32 size; /* size of RWQE array */ + u8 max_sge; + /* protect changes in this struct */ + spinlock_t lock ____cacheline_aligned_in_smp; +}; + +struct hfi1_srq { + struct ib_srq ibsrq; + struct hfi1_rq rq; + struct hfi1_mmap_info *ip; + /* send signal when number of RWQEs < limit */ + u32 limit; +}; + +struct hfi1_sge_state { + struct hfi1_sge *sg_list; /* next SGE to be used if any */ + struct hfi1_sge sge; /* progress state for the current SGE */ + u32 total_len; + u8 num_sge; +}; + +/* + * This structure holds the information that the send tasklet needs + * to send a RDMA read response or atomic operation. + */ +struct hfi1_ack_entry { + u8 opcode; + u8 sent; + u32 psn; + u32 lpsn; + union { + struct hfi1_sge rdma_sge; + u64 atomic_data; + }; +}; + +/* + * Variables prefixed with s_ are for the requester (sender). + * Variables prefixed with r_ are for the responder (receiver). + * Variables prefixed with ack_ are for responder replies. + * + * Common variables are protected by both r_rq.lock and s_lock in that order + * which only happens in modify_qp() or changing the QP 'state'. + */ +struct hfi1_qp { + struct ib_qp ibqp; + /* read mostly fields above and below */ + struct ib_ah_attr remote_ah_attr; + struct ib_ah_attr alt_ah_attr; + struct hfi1_qp __rcu *next; /* link list for QPN hash table */ + struct hfi1_swqe *s_wq; /* send work queue */ + struct hfi1_mmap_info *ip; + struct ahg_ib_header *s_hdr; /* next packet header to send */ + u8 s_sc; /* SC[0..4] for next packet */ + unsigned long timeout_jiffies; /* computed from timeout */ + + enum ib_mtu path_mtu; + int srate_mbps; /* s_srate (below) converted to Mbit/s */ + u32 remote_qpn; + u32 pmtu; /* decoded from path_mtu */ + u32 qkey; /* QKEY for this QP (for UD or RD) */ + u32 s_size; /* send work queue size */ + u32 s_rnr_timeout; /* number of milliseconds for RNR timeout */ + u32 s_ahgpsn; /* set to the psn in the copy of the header */ + + u8 state; /* QP state */ + u8 qp_access_flags; + u8 alt_timeout; /* Alternate path timeout for this QP */ + u8 timeout; /* Timeout for this QP */ + u8 s_srate; + u8 s_mig_state; + u8 port_num; + u8 s_pkey_index; /* PKEY index to use */ + u8 s_alt_pkey_index; /* Alternate path PKEY index to use */ + u8 r_max_rd_atomic; /* max number of RDMA read/atomic to receive */ + u8 s_max_rd_atomic; /* max number of RDMA read/atomic to send */ + u8 s_retry_cnt; /* number of times to retry */ + u8 s_rnr_retry_cnt; + u8 r_min_rnr_timer; /* retry timeout value for RNR NAKs */ + u8 s_max_sge; /* size of s_wq->sg_list */ + u8 s_draining; + + /* start of read/write fields */ + + atomic_t refcount ____cacheline_aligned_in_smp; + wait_queue_head_t wait; + + + struct hfi1_ack_entry s_ack_queue[HFI1_MAX_RDMA_ATOMIC + 1] + ____cacheline_aligned_in_smp; + struct hfi1_sge_state s_rdma_read_sge; + + spinlock_t r_lock ____cacheline_aligned_in_smp; /* used for APM */ + unsigned long r_aflags; + u64 r_wr_id; /* ID for current receive WQE */ + u32 r_ack_psn; /* PSN for next ACK or atomic ACK */ + u32 r_len; /* total length of r_sge */ + u32 r_rcv_len; /* receive data len processed */ + u32 r_psn; /* expected rcv packet sequence number */ + u32 r_msn; /* message sequence number */ + + u8 r_state; /* opcode of last packet received */ + u8 r_flags; + u8 r_head_ack_queue; /* index into s_ack_queue[] */ + + struct list_head rspwait; /* link for waiting to respond */ + + struct hfi1_sge_state r_sge; /* current receive data */ + struct hfi1_rq r_rq; /* receive work queue */ + + spinlock_t s_lock ____cacheline_aligned_in_smp; + unsigned long s_aflags; + struct hfi1_sge_state *s_cur_sge; + u32 s_flags; + struct hfi1_swqe *s_wqe; + struct hfi1_sge_state s_sge; /* current send request data */ + struct hfi1_mregion *s_rdma_mr; + struct sdma_engine *s_sde; /* current sde */ + u32 s_cur_size; /* size of send packet in bytes */ + u32 s_len; /* total length of s_sge */ + u32 s_rdma_read_len; /* total length of s_rdma_read_sge */ + u32 s_next_psn; /* PSN for next request */ + u32 s_last_psn; /* last response PSN processed */ + u32 s_sending_psn; /* lowest PSN that is being sent */ + u32 s_sending_hpsn; /* highest PSN that is being sent */ + u32 s_psn; /* current packet sequence number */ + u32 s_ack_rdma_psn; /* PSN for sending RDMA read responses */ + u32 s_ack_psn; /* PSN for acking sends and RDMA writes */ + u32 s_head; /* new entries added here */ + u32 s_tail; /* next entry to process */ + u32 s_cur; /* current work queue entry */ + u32 s_acked; /* last un-ACK'ed entry */ + u32 s_last; /* last completed entry */ + u32 s_ssn; /* SSN of tail entry */ + u32 s_lsn; /* limit sequence number (credit) */ + u16 s_hdrwords; /* size of s_hdr in 32 bit words */ + u16 s_rdma_ack_cnt; + s8 s_ahgidx; + u8 s_state; /* opcode of last packet sent */ + u8 s_ack_state; /* opcode of packet to ACK */ + u8 s_nak_state; /* non-zero if NAK is pending */ + u8 r_nak_state; /* non-zero if NAK is pending */ + u8 s_retry; /* requester retry counter */ + u8 s_rnr_retry; /* requester RNR retry counter */ + u8 s_num_rd_atomic; /* number of RDMA read/atomic pending */ + u8 s_tail_ack_queue; /* index into s_ack_queue[] */ + + struct hfi1_sge_state s_ack_rdma_sge; + struct timer_list s_timer; + + struct iowait s_iowait; + + struct hfi1_sge r_sg_list[0] /* verified SGEs */ + ____cacheline_aligned_in_smp; +}; + +/* + * Atomic bit definitions for r_aflags. + */ +#define HFI1_R_WRID_VALID 0 +#define HFI1_R_REWIND_SGE 1 + +/* + * Atomic bit definitions for s_aflags. + */ +#define HFI1_S_ECN 0 + +/* + * Bit definitions for r_flags. + */ +#define HFI1_R_REUSE_SGE 0x01 +#define HFI1_R_RDMAR_SEQ 0x02 +#define HFI1_R_RSP_NAK 0x04 +#define HFI1_R_RSP_SEND 0x08 +#define HFI1_R_COMM_EST 0x10 + +/* + * Bit definitions for s_flags. + * + * HFI1_S_SIGNAL_REQ_WR - set if QP send WRs contain completion signaled + * HFI1_S_BUSY - send tasklet is processing the QP + * HFI1_S_TIMER - the RC retry timer is active + * HFI1_S_ACK_PENDING - an ACK is waiting to be sent after RDMA read/atomics + * HFI1_S_WAIT_FENCE - waiting for all prior RDMA read or atomic SWQEs + * before processing the next SWQE + * HFI1_S_WAIT_RDMAR - waiting for a RDMA read or atomic SWQE to complete + * before processing the next SWQE + * HFI1_S_WAIT_RNR - waiting for RNR timeout + * HFI1_S_WAIT_SSN_CREDIT - waiting for RC credits to process next SWQE + * HFI1_S_WAIT_DMA - waiting for send DMA queue to drain before generating + * next send completion entry not via send DMA + * HFI1_S_WAIT_PIO - waiting for a send buffer to be available + * HFI1_S_WAIT_TX - waiting for a struct verbs_txreq to be available + * HFI1_S_WAIT_DMA_DESC - waiting for DMA descriptors to be available + * HFI1_S_WAIT_KMEM - waiting for kernel memory to be available + * HFI1_S_WAIT_PSN - waiting for a packet to exit the send DMA queue + * HFI1_S_WAIT_ACK - waiting for an ACK packet before sending more requests + * HFI1_S_SEND_ONE - send one packet, request ACK, then wait for ACK + */ +#define HFI1_S_SIGNAL_REQ_WR 0x0001 +#define HFI1_S_BUSY 0x0002 +#define HFI1_S_TIMER 0x0004 +#define HFI1_S_RESP_PENDING 0x0008 +#define HFI1_S_ACK_PENDING 0x0010 +#define HFI1_S_WAIT_FENCE 0x0020 +#define HFI1_S_WAIT_RDMAR 0x0040 +#define HFI1_S_WAIT_RNR 0x0080 +#define HFI1_S_WAIT_SSN_CREDIT 0x0100 +#define HFI1_S_WAIT_DMA 0x0200 +#define HFI1_S_WAIT_PIO 0x0400 +#define HFI1_S_WAIT_TX 0x0800 +#define HFI1_S_WAIT_DMA_DESC 0x1000 +#define HFI1_S_WAIT_KMEM 0x2000 +#define HFI1_S_WAIT_PSN 0x4000 +#define HFI1_S_WAIT_ACK 0x8000 +#define HFI1_S_SEND_ONE 0x10000 +#define HFI1_S_UNLIMITED_CREDIT 0x20000 +#define HFI1_S_AHG_VALID 0x40000 +#define HFI1_S_AHG_CLEAR 0x80000 + +/* + * Wait flags that would prevent any packet type from being sent. + */ +#define HFI1_S_ANY_WAIT_IO (HFI1_S_WAIT_PIO | HFI1_S_WAIT_TX | \ + HFI1_S_WAIT_DMA_DESC | HFI1_S_WAIT_KMEM) + +/* + * Wait flags that would prevent send work requests from making progress. + */ +#define HFI1_S_ANY_WAIT_SEND (HFI1_S_WAIT_FENCE | HFI1_S_WAIT_RDMAR | \ + HFI1_S_WAIT_RNR | HFI1_S_WAIT_SSN_CREDIT | HFI1_S_WAIT_DMA | \ + HFI1_S_WAIT_PSN | HFI1_S_WAIT_ACK) + +#define HFI1_S_ANY_WAIT (HFI1_S_ANY_WAIT_IO | HFI1_S_ANY_WAIT_SEND) + +#define HFI1_PSN_CREDIT 16 + +/* + * Since struct hfi1_swqe is not a fixed size, we can't simply index into + * struct hfi1_qp.s_wq. This function does the array index computation. + */ +static inline struct hfi1_swqe *get_swqe_ptr(struct hfi1_qp *qp, + unsigned n) +{ + return (struct hfi1_swqe *)((char *)qp->s_wq + + (sizeof(struct hfi1_swqe) + + qp->s_max_sge * + sizeof(struct hfi1_sge)) * n); +} + +/* + * Since struct hfi1_rwqe is not a fixed size, we can't simply index into + * struct hfi1_rwq.wq. This function does the array index computation. + */ +static inline struct hfi1_rwqe *get_rwqe_ptr(struct hfi1_rq *rq, unsigned n) +{ + return (struct hfi1_rwqe *) + ((char *) rq->wq->wq + + (sizeof(struct hfi1_rwqe) + + rq->max_sge * sizeof(struct ib_sge)) * n); +} + +struct hfi1_lkey_table { + spinlock_t lock; /* protect changes in this struct */ + u32 next; /* next unused index (speeds search) */ + u32 gen; /* generation count */ + u32 max; /* size of the table */ + struct hfi1_mregion __rcu **table; +}; + +struct hfi1_opcode_stats { + u64 n_packets; /* number of packets */ + u64 n_bytes; /* total number of bytes */ +}; + +struct hfi1_opcode_stats_perctx { + struct hfi1_opcode_stats stats[128]; +}; + +static inline void inc_opstats( + u32 tlen, + struct hfi1_opcode_stats *stats) +{ +#ifdef CONFIG_DEBUG_FS + stats->n_bytes += tlen; + stats->n_packets++; +#endif +} + +struct hfi1_ibport { + struct hfi1_qp __rcu *qp0; + struct hfi1_qp __rcu *qp1; + struct ib_mad_agent *send_agent; /* agent for SMI (traps) */ + struct hfi1_ah *sm_ah; + struct hfi1_ah *smi_ah; + struct rb_root mcast_tree; + spinlock_t lock; /* protect changes in this struct */ + + /* non-zero when timer is set */ + unsigned long mkey_lease_timeout; + unsigned long trap_timeout; + __be64 gid_prefix; /* in network order */ + __be64 mkey; + __be64 guids[HFI1_GUIDS_PER_PORT - 1]; /* writable GUIDs */ + u64 tid; /* TID for traps */ + u64 n_rc_resends; + u64 n_seq_naks; + u64 n_rdma_seq; + u64 n_rnr_naks; + u64 n_other_naks; + u64 n_loop_pkts; + u64 n_pkt_drops; + u64 n_vl15_dropped; + u64 n_rc_timeouts; + u64 n_dmawait; + u64 n_unaligned; + u64 n_rc_dupreq; + u64 n_rc_seqnak; + + /* Hot-path per CPU counters to avoid cacheline trading to update */ + u64 z_rc_acks; + u64 z_rc_qacks; + u64 z_rc_delayed_comp; + u64 __percpu *rc_acks; + u64 __percpu *rc_qacks; + u64 __percpu *rc_delayed_comp; + + u32 port_cap_flags; + u32 pma_sample_start; + u32 pma_sample_interval; + __be16 pma_counter_select[5]; + u16 pma_tag; + u16 pkey_violations; + u16 qkey_violations; + u16 mkey_violations; + u16 mkey_lease_period; + u16 sm_lid; + u16 repress_traps; + u8 sm_sl; + u8 mkeyprot; + u8 subnet_timeout; + u8 vl_high_limit; + /* the first 16 entries are sl_to_vl for !OPA */ + u8 sl_to_sc[32]; + u8 sc_to_sl[32]; +}; + + +struct hfi1_qp_ibdev; +struct hfi1_ibdev { + struct ib_device ibdev; + struct list_head pending_mmaps; + spinlock_t mmap_offset_lock; /* protect mmap_offset */ + u32 mmap_offset; + struct hfi1_mregion __rcu *dma_mr; + + struct hfi1_qp_ibdev *qp_dev; + + /* QP numbers are shared by all IB ports */ + struct hfi1_lkey_table lk_table; + struct list_head txwait; /* list for wait verbs_txreq */ + struct list_head memwait; /* list for wait kernel memory */ + struct list_head txreq_free; + struct timer_list mem_timer; + struct tx_pio_header *pio_hdrs; + size_t pio_hdr_bytes; + dma_addr_t pio_hdrs_phys; + /* list of QPs waiting for RNR timer */ + spinlock_t pending_lock; /* protect wait lists, PMA counters, etc. */ + + u32 n_piowait; + u32 n_txwait; + + u32 n_pds_allocated; /* number of PDs allocated for device */ + spinlock_t n_pds_lock; + u32 n_ahs_allocated; /* number of AHs allocated for device */ + spinlock_t n_ahs_lock; + u32 n_cqs_allocated; /* number of CQs allocated for device */ + spinlock_t n_cqs_lock; + u32 n_qps_allocated; /* number of QPs allocated for device */ + spinlock_t n_qps_lock; + u32 n_srqs_allocated; /* number of SRQs allocated for device */ + spinlock_t n_srqs_lock; + u32 n_mcast_grps_allocated; /* number of mcast groups allocated */ + spinlock_t n_mcast_grps_lock; +#ifdef CONFIG_DEBUG_FS + /* per HFI debugfs */ + struct dentry *hfi1_ibdev_dbg; + /* per HFI symlinks to above */ + struct dentry *hfi1_ibdev_link; +#endif +}; + +struct hfi1_verbs_counters { + u64 symbol_error_counter; + u64 link_error_recovery_counter; + u64 link_downed_counter; + u64 port_rcv_errors; + u64 port_rcv_remphys_errors; + u64 port_xmit_discards; + u64 port_xmit_data; + u64 port_rcv_data; + u64 port_xmit_packets; + u64 port_rcv_packets; + u32 local_link_integrity_errors; + u32 excessive_buffer_overrun_errors; + u32 vl15_dropped; +}; + +static inline struct hfi1_mr *to_imr(struct ib_mr *ibmr) +{ + return container_of(ibmr, struct hfi1_mr, ibmr); +} + +static inline struct hfi1_pd *to_ipd(struct ib_pd *ibpd) +{ + return container_of(ibpd, struct hfi1_pd, ibpd); +} + +static inline struct hfi1_ah *to_iah(struct ib_ah *ibah) +{ + return container_of(ibah, struct hfi1_ah, ibah); +} + +static inline struct hfi1_cq *to_icq(struct ib_cq *ibcq) +{ + return container_of(ibcq, struct hfi1_cq, ibcq); +} + +static inline struct hfi1_srq *to_isrq(struct ib_srq *ibsrq) +{ + return container_of(ibsrq, struct hfi1_srq, ibsrq); +} + +static inline struct hfi1_qp *to_iqp(struct ib_qp *ibqp) +{ + return container_of(ibqp, struct hfi1_qp, ibqp); +} + +static inline struct hfi1_ibdev *to_idev(struct ib_device *ibdev) +{ + return container_of(ibdev, struct hfi1_ibdev, ibdev); +} + +/* + * Send if not busy or waiting for I/O and either + * a RC response is pending or we can process send work requests. + */ +static inline int hfi1_send_ok(struct hfi1_qp *qp) +{ + return !(qp->s_flags & (HFI1_S_BUSY | HFI1_S_ANY_WAIT_IO)) && + (qp->s_hdrwords || (qp->s_flags & HFI1_S_RESP_PENDING) || + !(qp->s_flags & HFI1_S_ANY_WAIT_SEND)); +} + +/* + * This must be called with s_lock held. + */ +void hfi1_schedule_send(struct hfi1_qp *qp); +void hfi1_bad_pqkey(struct hfi1_ibport *ibp, __be16 trap_num, u32 key, u32 sl, + u32 qp1, u32 qp2, __be16 lid1, __be16 lid2); +void hfi1_cap_mask_chg(struct hfi1_ibport *ibp); +void hfi1_sys_guid_chg(struct hfi1_ibport *ibp); +void hfi1_node_desc_chg(struct hfi1_ibport *ibp); +int hfi1_process_mad(struct ib_device *ibdev, int mad_flags, u8 port, + const struct ib_wc *in_wc, const struct ib_grh *in_grh, + const struct ib_mad_hdr *in_mad, size_t in_mad_size, + struct ib_mad_hdr *out_mad, size_t *out_mad_size, + u16 *out_mad_pkey_index); +int hfi1_create_agents(struct hfi1_ibdev *dev); +void hfi1_free_agents(struct hfi1_ibdev *dev); + +/* + * The PSN_MASK and PSN_SHIFT allow for + * 1) comparing two PSNs + * 2) returning the PSN with any upper bits masked + * 3) returning the difference between to PSNs + * + * The number of significant bits in the PSN must + * necessarily be at least one bit less than + * the container holding the PSN. + */ +#ifndef CONFIG_HFI1_VERBS_31BIT_PSN +#define PSN_MASK 0xFFFFFF +#define PSN_SHIFT 8 +#else +#define PSN_MASK 0x7FFFFFFF +#define PSN_SHIFT 1 +#endif +#define PSN_MODIFY_MASK 0xFFFFFF + +/* + * Compare the lower 24 bits of the msn values. + * Returns an integer <, ==, or > than zero. + */ +static inline int cmp_msn(u32 a, u32 b) +{ + return (((int) a) - ((int) b)) << 8; +} + +/* + * Compare two PSNs + * Returns an integer <, ==, or > than zero. + */ +static inline int cmp_psn(u32 a, u32 b) +{ + return (((int) a) - ((int) b)) << PSN_SHIFT; +} + +/* + * Return masked PSN + */ +static inline u32 mask_psn(u32 a) +{ + return a & PSN_MASK; +} + +/* + * Return delta between two PSNs + */ +static inline u32 delta_psn(u32 a, u32 b) +{ + return (((int)a - (int)b) << PSN_SHIFT) >> PSN_SHIFT; +} + +struct hfi1_mcast *hfi1_mcast_find(struct hfi1_ibport *ibp, union ib_gid *mgid); + +int hfi1_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int hfi1_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid); + +int hfi1_mcast_tree_empty(struct hfi1_ibport *ibp); + +struct verbs_txreq; +void hfi1_put_txreq(struct verbs_txreq *tx); + +int hfi1_verbs_send(struct hfi1_qp *qp, struct ahg_ib_header *ahdr, + u32 hdrwords, struct hfi1_sge_state *ss, u32 len); + +void hfi1_copy_sge(struct hfi1_sge_state *ss, void *data, u32 length, + int release); + +void hfi1_skip_sge(struct hfi1_sge_state *ss, u32 length, int release); + +void hfi1_uc_rcv(struct hfi1_ibport *ibp, struct hfi1_ib_header *hdr, + u32 rcv_flags, void *data, u32 tlen, struct hfi1_qp *qp); + +void hfi1_rc_rcv(struct hfi1_ctxtdata *rcd, struct hfi1_ib_header *hdr, + u32 rcv_flags, void *data, u32 tlen, struct hfi1_qp *qp); + +void hfi1_rc_hdrerr( + struct hfi1_ctxtdata *rcd, + struct hfi1_ib_header *hdr, + u32 rcv_flags, + struct hfi1_qp *qp); + +u8 ah_to_sc(struct ib_device *ibdev, struct ib_ah_attr *ah_attr); + +int hfi1_check_ah(struct ib_device *ibdev, struct ib_ah_attr *ah_attr); + +struct ib_ah *hfi1_create_qp0_ah(struct hfi1_ibport *ibp, u16 dlid); + +void hfi1_rc_rnr_retry(unsigned long arg); + +void hfi1_rc_send_complete(struct hfi1_qp *qp, struct hfi1_ib_header *hdr); + +void hfi1_rc_error(struct hfi1_qp *qp, enum ib_wc_status err); + +void hfi1_ud_rcv(struct hfi1_ibport *ibp, struct hfi1_ib_header *hdr, + u32 rcv_flags, void *data, u32 tlen, struct hfi1_qp *qp); + +int hfi1_lookup_pkey_idx(struct hfi1_ibport *ibp, u16 pkey); + +int hfi1_alloc_lkey(struct hfi1_mregion *mr, int dma_region); + +void hfi1_free_lkey(struct hfi1_mregion *mr); + +int hfi1_lkey_ok(struct hfi1_lkey_table *rkt, struct hfi1_pd *pd, + struct hfi1_sge *isge, struct ib_sge *sge, int acc); + +int hfi1_rkey_ok(struct hfi1_qp *qp, struct hfi1_sge *sge, + u32 len, u64 vaddr, u32 rkey, int acc); + +int hfi1_post_srq_receive(struct ib_srq *ibsrq, struct ib_recv_wr *wr, + struct ib_recv_wr **bad_wr); + +struct ib_srq *hfi1_create_srq(struct ib_pd *ibpd, + struct ib_srq_init_attr *srq_init_attr, + struct ib_udata *udata); + +int hfi1_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr, + enum ib_srq_attr_mask attr_mask, + struct ib_udata *udata); + +int hfi1_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr); + +int hfi1_destroy_srq(struct ib_srq *ibsrq); + +int hfi1_cq_init(struct hfi1_devdata *dd); + +void hfi1_cq_exit(struct hfi1_devdata *dd); + +void hfi1_cq_enter(struct hfi1_cq *cq, struct ib_wc *entry, int sig); + +int hfi1_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry); + +struct ib_cq *hfi1_create_cq( + struct ib_device *ibdev, + const struct ib_cq_init_attr *attr, + struct ib_ucontext *context, + struct ib_udata *udata); + +int hfi1_destroy_cq(struct ib_cq *ibcq); + +int hfi1_req_notify_cq( + struct ib_cq *ibcq, + enum ib_cq_notify_flags notify_flags); + +int hfi1_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata); + +struct ib_mr *hfi1_get_dma_mr(struct ib_pd *pd, int acc); + +struct ib_mr *hfi1_reg_phys_mr(struct ib_pd *pd, + struct ib_phys_buf *buffer_list, + int num_phys_buf, int acc, u64 *iova_start); + +struct ib_mr *hfi1_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, + u64 virt_addr, int mr_access_flags, + struct ib_udata *udata); + +int hfi1_dereg_mr(struct ib_mr *ibmr); + +struct ib_mr *hfi1_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len); + +struct ib_fast_reg_page_list *hfi1_alloc_fast_reg_page_list( + struct ib_device *ibdev, int page_list_len); + +void hfi1_free_fast_reg_page_list(struct ib_fast_reg_page_list *pl); + +int hfi1_fast_reg_mr(struct hfi1_qp *qp, struct ib_send_wr *wr); + +struct ib_fmr *hfi1_alloc_fmr(struct ib_pd *pd, int mr_access_flags, + struct ib_fmr_attr *fmr_attr); + +int hfi1_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list, + int list_len, u64 iova); + +int hfi1_unmap_fmr(struct list_head *fmr_list); + +int hfi1_dealloc_fmr(struct ib_fmr *ibfmr); + +static inline void hfi1_get_mr(struct hfi1_mregion *mr) +{ + atomic_inc(&mr->refcount); +} + +static inline void hfi1_put_mr(struct hfi1_mregion *mr) +{ + if (unlikely(atomic_dec_and_test(&mr->refcount))) + complete(&mr->comp); +} + +static inline void hfi1_put_ss(struct hfi1_sge_state *ss) +{ + while (ss->num_sge) { + hfi1_put_mr(ss->sge.mr); + if (--ss->num_sge) + ss->sge = *ss->sg_list++; + } +} + +void hfi1_release_mmap_info(struct kref *ref); + +struct hfi1_mmap_info *hfi1_create_mmap_info(struct hfi1_ibdev *dev, u32 size, + struct ib_ucontext *context, + void *obj); + +void hfi1_update_mmap_info(struct hfi1_ibdev *dev, struct hfi1_mmap_info *ip, + u32 size, void *obj); + +int hfi1_mmap(struct ib_ucontext *context, struct vm_area_struct *vma); + +int hfi1_get_rwqe(struct hfi1_qp *qp, int wr_id_only); + +void hfi1_migrate_qp(struct hfi1_qp *qp); + +int hfi1_ruc_check_hdr(struct hfi1_ibport *ibp, struct hfi1_ib_header *hdr, + int has_grh, struct hfi1_qp *qp, u32 bth0); + +u32 hfi1_make_grh(struct hfi1_ibport *ibp, struct ib_grh *hdr, + struct ib_global_route *grh, u32 hwords, u32 nwords); + +void clear_ahg(struct hfi1_qp *qp); + +void hfi1_make_ruc_header(struct hfi1_qp *qp, struct hfi1_other_headers *ohdr, + u32 bth0, u32 bth2, int middle); + +void hfi1_do_send(struct work_struct *work); + +void hfi1_send_complete(struct hfi1_qp *qp, struct hfi1_swqe *wqe, + enum ib_wc_status status); + +void hfi1_send_rc_ack(struct hfi1_ctxtdata *, struct hfi1_qp *qp, int is_fecn); + +int hfi1_make_rc_req(struct hfi1_qp *qp); + +int hfi1_make_uc_req(struct hfi1_qp *qp); + +int hfi1_make_ud_req(struct hfi1_qp *qp); + +int hfi1_register_ib_device(struct hfi1_devdata *); + +void hfi1_unregister_ib_device(struct hfi1_devdata *); + +void hfi1_ib_rcv(struct hfi1_packet *packet); + +unsigned hfi1_get_npkeys(struct hfi1_devdata *); + +unsigned hfi1_get_pkey(struct hfi1_ibport *, unsigned); + +int hfi1_verbs_send_dma(struct hfi1_qp *qp, struct ahg_ib_header *hdr, + u32 hdrwords, struct hfi1_sge_state *ss, u32 len, + u32 plen, u32 dwords, u64 pbc); + +int hfi1_verbs_send_pio(struct hfi1_qp *qp, struct ahg_ib_header *hdr, + u32 hdrwords, struct hfi1_sge_state *ss, u32 len, + u32 plen, u32 dwords, u64 pbc); + +struct send_context *qp_to_send_context(struct hfi1_qp *qp, u8 sc5); + +extern const enum ib_wc_opcode ib_hfi1_wc_opcode[]; + +extern const u8 hdr_len_by_opcode[]; + +extern const int ib_hfi1_state_ops[]; + +extern __be64 ib_hfi1_sys_image_guid; /* in network order */ + +extern unsigned int hfi1_lkey_table_size; + +extern unsigned int hfi1_max_cqes; + +extern unsigned int hfi1_max_cqs; + +extern unsigned int hfi1_max_qp_wrs; + +extern unsigned int hfi1_max_qps; + +extern unsigned int hfi1_max_sges; + +extern unsigned int hfi1_max_mcast_grps; + +extern unsigned int hfi1_max_mcast_qp_attached; + +extern unsigned int hfi1_max_srqs; + +extern unsigned int hfi1_max_srq_sges; + +extern unsigned int hfi1_max_srq_wrs; + +extern const u32 ib_hfi1_rnr_table[]; + +extern struct ib_dma_mapping_ops hfi1_dma_mapping_ops; + +#endif /* HFI1_VERBS_H */