From patchwork Mon Apr 16 01:02:07 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Zhu Yanjun X-Patchwork-Id: 10341965 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id C21B6601C2 for ; Mon, 16 Apr 2018 01:01:01 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A914C27F10 for ; Mon, 16 Apr 2018 01:01:01 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9DF3127F17; Mon, 16 Apr 2018 01:01:01 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI, T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C0AFB27EE2 for ; Mon, 16 Apr 2018 01:01:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752670AbeDPBA6 (ORCPT ); Sun, 15 Apr 2018 21:00:58 -0400 Received: from userp2130.oracle.com ([156.151.31.86]:38158 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752662AbeDPBA5 (ORCPT ); Sun, 15 Apr 2018 21:00:57 -0400 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w3G10set099106; Mon, 16 Apr 2018 01:00:54 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : subject : date : message-id : mime-version : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=dciNE0HiGE3CMya8zIhPZeZ07MNui48yHHN8Ok086fM=; b=eEtAfa+ner5/PyPrF7K0nVIzu/oQ/Hu+Es7owC/NaqyLfxpCuOt4FEmMkfFv7rQSHsdm l//m6GaZ8Tn588ZXm/VA8dT16M5e6CEf4eFmKd6xdat6eWoK03mfrf4gKmrwJq+YsQXY lrucdmiyhHuowq9slFprYLPh3L5hMNpWy2Fii3kcTFYlLBALmmHElyFJhA/LTwbSQWrI HIP9U9K0LqLlvogKYvMf8GN+kktJhyj3hbNkqKbZre5r/VHqVxHXKiqfRad4J60h0AEo hokp+CV8aDku6EqqARYZDVNy7JXON7NgZSJp3BHEtHWhZj6hK7ETL2q3FlGVnaoAhujq rA== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2130.oracle.com with ESMTP id 2hbam5tv73-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 16 Apr 2018 01:00:54 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w3G10mRi008300 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 16 Apr 2018 01:00:48 GMT Received: from abhmp0009.oracle.com (abhmp0009.oracle.com [141.146.116.15]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w3G10mio006513; Mon, 16 Apr 2018 01:00:48 GMT Received: from shipfan.cn.oracle.com (/10.113.210.105) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sun, 15 Apr 2018 18:00:48 -0700 From: Zhu Yanjun To: tariqt@mellanox.com, netdev@vger.kernel.org, linux-rdma@vger.kernel.org, haakon.bugge@oracle.com Subject: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device Date: Sun, 15 Apr 2018 21:02:07 -0400 Message-Id: <1523840527-22746-1-git-send-email-yanjun.zhu@oracle.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8864 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804160009 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP While a faulty cable is used or HCA firmware error, HCA device will be offline. When the driver is accessing this offline device, the following call trace will pop out. " ... [] dump_stack+0x63/0x81 [] panic+0xcc/0x21b [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core] [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core] [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core] [] __mlx4_cmd+0xb0/0x160 [mlx4_core] [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core] [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core] ... " In the above call trace, the function mlx4_cmd_poll calls the function mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out. This is not reasonable. Since HCA device is offline when it is being accessed, it should not be reset again. In this patch, since HCA is offline, the function mlx4_cmd_post returns an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns instead of resetting HCA. CC: Srinivas Eeda CC: Junxiao Bi Suggested-by: HÃ¥kon Bugge Signed-off-by: Zhu Yanjun --- drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 6a9086d..f1c8c42 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param, * Device is going through error recovery * and cannot accept commands. */ + mlx4_err(dev, "%s : Device is in error recovery.\n", __func__); + ret = -EINVAL; goto out; } @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param, } out_reset: + if (err == -EINVAL) + goto out; + if (err) err = mlx4_cmd_reset_flow(dev, op, op_modifier, err); out: @@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param, *out_param = context->out_param; out_reset: + if (err == -EINVAL) + goto out; + if (err) err = mlx4_cmd_reset_flow(dev, op, op_modifier, err); out: