From patchwork Wed Mar 17 08:15:42 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Selvin Xavier X-Patchwork-Id: 12145075 X-Patchwork-Delegate: jgg@ziepe.ca Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A522AC433DB for ; Wed, 17 Mar 2021 08:16:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 51F4E64F92 for ; Wed, 17 Mar 2021 08:16:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229549AbhCQIQL (ORCPT ); Wed, 17 Mar 2021 04:16:11 -0400 Received: from lpdvacalvio01.broadcom.com ([192.19.229.182]:45658 "EHLO relay.smtp-ext.broadcom.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229601AbhCQIPq (ORCPT ); Wed, 17 Mar 2021 04:15:46 -0400 Received: from dhcp-10-192-206-197.iig.avagotech.net.net (dhcp-10-123-156-118.dhcp.broadcom.net [10.123.156.118]) by relay.smtp-ext.broadcom.com (Postfix) with ESMTP id 378BA80C3; Wed, 17 Mar 2021 01:15:43 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 relay.smtp-ext.broadcom.com 378BA80C3 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=broadcom.com; s=dkimrelay; t=1615968946; bh=dXqeO68FPGE71Zf0vtrgNHiO8cQMV/RyZzMO+yWxRW8=; h=From:To:Cc:Subject:Date:From; b=Io/jtsP6X/ZFVB9hBGBj5qpRPPI1IV7K99DLURY0xf+sK8E44b5MC4vqqx3qyTt4Y wUA6mKF2yw7oGWeK++dVjmTQexMSYvOxKIl2UWZ0A45PMDQtP+clUXX8fkTm4qmG65 7ZLyBSYidwDY2+1XYooRE+V0JJ0nKF7gqiJriYok= From: Selvin Xavier To: jgg@ziepe.ca, dledford@redhat.com Cc: linux-rdma@vger.kernel.org, Selvin Xavier , Naresh Kumar PBS , Devesh Sharma Subject: [PATCH for-next v2] RDMA/bnxt_re: Move device to error state upon device crash Date: Wed, 17 Mar 2021 01:15:42 -0700 Message-Id: <1615968942-30970-1-git-send-email-selvin.xavier@broadcom.com> X-Mailer: git-send-email 2.5.5 Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org When L2 driver detects a device crash or device undergone reset, it invokes a stop callback to recover from error. Current RoCE driver doesn't recover the device. So move the device to error state and dispatch fatal events to all qps Release the MSIx vectors to avoid a crash when L2 driver disables the MSIx. Also, check for the device state to avoid posting further commands to the HW. Signed-off-by: Naresh Kumar PBS Signed-off-by: Devesh Sharma Signed-off-by: Selvin Xavier --- v1->v2: Fix the build warning Reported-by: kernel test robot drivers/infiniband/hw/bnxt_re/bnxt_re.h | 1 + drivers/infiniband/hw/bnxt_re/main.c | 40 ++++++++++++++++++++++++++++++ drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 4 +++ drivers/infiniband/hw/bnxt_re/qplib_rcfw.h | 2 ++ 4 files changed, 47 insertions(+) diff --git a/drivers/infiniband/hw/bnxt_re/bnxt_re.h b/drivers/infiniband/hw/bnxt_re/bnxt_re.h index b930ea3..ba26d8e 100644 --- a/drivers/infiniband/hw/bnxt_re/bnxt_re.h +++ b/drivers/infiniband/hw/bnxt_re/bnxt_re.h @@ -138,6 +138,7 @@ struct bnxt_re_dev { #define BNXT_RE_FLAG_QOS_WORK_REG 5 #define BNXT_RE_FLAG_RESOURCES_ALLOCATED 7 #define BNXT_RE_FLAG_RESOURCES_INITIALIZED 8 +#define BNXT_RE_FLAG_ERR_DEVICE_DETACHED 17 #define BNXT_RE_FLAG_ISSUE_ROCE_STATS 29 struct net_device *netdev; unsigned int version, major, minor; diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c index fdb8c24..b30d37f 100644 --- a/drivers/infiniband/hw/bnxt_re/main.c +++ b/drivers/infiniband/hw/bnxt_re/main.c @@ -81,6 +81,7 @@ static struct workqueue_struct *bnxt_re_wq; static void bnxt_re_remove_device(struct bnxt_re_dev *rdev); static void bnxt_re_dealloc_driver(struct ib_device *ib_dev); static void bnxt_re_stop_irq(void *handle); +static void bnxt_re_dev_stop(struct bnxt_re_dev *rdev); static void bnxt_re_set_drv_mode(struct bnxt_re_dev *rdev, u8 mode) { @@ -221,6 +222,37 @@ static void bnxt_re_set_resource_limits(struct bnxt_re_dev *rdev) /* for handling bnxt_en callbacks later */ static void bnxt_re_stop(void *p) { + struct bnxt_re_dev *rdev = p; + struct bnxt *bp; + + if (!rdev) + return; + ASSERT_RTNL(); + + /* L2 driver invokes this callback during device error/crash or device + * reset. Current RoCE driver doesn't recover the device in case of + * error. Handle the error by dispatching fatal events to all qps + * ie. by calling bnxt_re_dev_stop and release the MSIx vectors as + * L2 driver want to modify the MSIx table. + */ + bp = netdev_priv(rdev->netdev); + + ibdev_info(&rdev->ibdev, "Handle device stop call from L2 driver"); + /* Check the current device state from L2 structure and move the + * device to detached state if FW_FATAL_COND is set. + * This prevents more commands to HW during clean-up, + * in case the device is already in error. + */ + if (test_bit(BNXT_STATE_FW_FATAL_COND, &bp->state)) + set_bit(ERR_DEVICE_DETACHED, &rdev->rcfw.cmdq.flags); + + bnxt_re_dev_stop(rdev); + bnxt_re_stop_irq(rdev); + /* Move the device states to detached and avoid sending any more + * commands to HW + */ + set_bit(BNXT_RE_FLAG_ERR_DEVICE_DETACHED, &rdev->flags); + set_bit(ERR_DEVICE_DETACHED, &rdev->rcfw.cmdq.flags); } static void bnxt_re_start(void *p) @@ -234,6 +266,8 @@ static void bnxt_re_sriov_config(void *p, int num_vfs) if (!rdev) return; + if (test_bit(BNXT_RE_FLAG_ERR_DEVICE_DETACHED, &rdev->flags)) + return; rdev->num_vfs = num_vfs; if (!bnxt_qplib_is_chip_gen_p5(rdev->chip_ctx)) { bnxt_re_set_resource_limits(rdev); @@ -427,6 +461,9 @@ static int bnxt_re_net_ring_free(struct bnxt_re_dev *rdev, if (!en_dev) return rc; + if (test_bit(BNXT_RE_FLAG_ERR_DEVICE_DETACHED, &rdev->flags)) + return 0; + memset(&fw_msg, 0, sizeof(fw_msg)); bnxt_re_init_hwrm_hdr(rdev, (void *)&req, HWRM_RING_FREE, -1, -1); @@ -489,6 +526,9 @@ static int bnxt_re_net_stats_ctx_free(struct bnxt_re_dev *rdev, if (!en_dev) return rc; + if (test_bit(BNXT_RE_FLAG_ERR_DEVICE_DETACHED, &rdev->flags)) + return 0; + memset(&fw_msg, 0, sizeof(fw_msg)); bnxt_re_init_hwrm_hdr(rdev, (void *)&req, HWRM_STAT_CTX_FREE, -1, -1); diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c index 441eb42..5d384de 100644 --- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c +++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c @@ -212,6 +212,10 @@ int bnxt_qplib_rcfw_send_message(struct bnxt_qplib_rcfw *rcfw, u8 opcode, retry_cnt = 0xFF; int rc = 0; + /* Prevent posting if f/w is not in a state to process */ + if (test_bit(ERR_DEVICE_DETACHED, &rcfw->cmdq.flags)) + return 0; + do { opcode = req->opcode; rc = __send_message(rcfw, req, resp, sb, is_block); diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h index 5f2f0a5..9474c00 100644 --- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h +++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.h @@ -138,6 +138,8 @@ struct bnxt_qplib_qp_node { #define FIRMWARE_INITIALIZED_FLAG (0) #define FIRMWARE_FIRST_FLAG (31) #define FIRMWARE_TIMED_OUT (3) +#define ERR_DEVICE_DETACHED (4) + struct bnxt_qplib_cmdq_mbox { struct bnxt_qplib_reg_desc reg; void __iomem *prod;