diff mbox series

[for-next,3/9] RDMA/hns: Fix cpu stuck caused by printings during reset

Message ID 20240906093444.3571619-4-huangjunxian6@hisilicon.com (mailing list archive)
State Changes Requested
Headers show
Series RDMA/hns: Bugfixes and one improvement | expand

Commit Message

Junxian Huang Sept. 6, 2024, 9:34 a.m. UTC
From: wenglianfa <wenglianfa@huawei.com>

During reset, cmd to destroy resources such as qp, cq, and mr may
fail, and error logs will be printed. When a large number of
resources are destroyed, there will be lots of printings, and it
may lead to a cpu stuck. Replace the printing functions in these
paths with the ratelimited version.

Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Fixes: c7bcb13442e1 ("RDMA/hns: Add SRQ support for hip08 kernel mode")
Fixes: 70f92521584f ("RDMA/hns: Use the reserved loopback QPs to free MR before destroying MPT")
Fixes: 926a01dc000d ("RDMA/hns: Add QP operations support for hip08 SoC")
Signed-off-by: wenglianfa <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
---
 drivers/infiniband/hw/hns/hns_roce_cq.c    |  4 +-
 drivers/infiniband/hw/hns/hns_roce_hem.c   |  4 +-
 drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 69 +++++++++++-----------
 drivers/infiniband/hw/hns/hns_roce_mr.c    |  4 +-
 drivers/infiniband/hw/hns/hns_roce_srq.c   |  4 +-
 5 files changed, 44 insertions(+), 41 deletions(-)

Comments

Leon Romanovsky Sept. 10, 2024, 1:09 p.m. UTC | #1
On Fri, Sep 06, 2024 at 05:34:38PM +0800, Junxian Huang wrote:
> From: wenglianfa <wenglianfa@huawei.com>
> 
> During reset, cmd to destroy resources such as qp, cq, and mr may
> fail, and error logs will be printed. When a large number of
> resources are destroyed, there will be lots of printings, and it
> may lead to a cpu stuck. Replace the printing functions in these
> paths with the ratelimited version.

At lease some of them if not most should be deleted.

Thanks
Junxian Huang Sept. 11, 2024, 1:34 a.m. UTC | #2
On 2024/9/10 21:09, Leon Romanovsky wrote:
> On Fri, Sep 06, 2024 at 05:34:38PM +0800, Junxian Huang wrote:
>> From: wenglianfa <wenglianfa@huawei.com>
>>
>> During reset, cmd to destroy resources such as qp, cq, and mr may
>> fail, and error logs will be printed. When a large number of
>> resources are destroyed, there will be lots of printings, and it
>> may lead to a cpu stuck. Replace the printing functions in these
>> paths with the ratelimited version.
> 
> At lease some of them if not most should be deleted.
> 

Hi Leon,I wonder if there is a clear standard about whether printing
can be added?

Thanks,
Junxian

> Thanks
Leon Romanovsky Sept. 11, 2024, 1:25 p.m. UTC | #3
On Wed, Sep 11, 2024 at 09:34:19AM +0800, Junxian Huang wrote:
> 
> 
> On 2024/9/10 21:09, Leon Romanovsky wrote:
> > On Fri, Sep 06, 2024 at 05:34:38PM +0800, Junxian Huang wrote:
> >> From: wenglianfa <wenglianfa@huawei.com>
> >>
> >> During reset, cmd to destroy resources such as qp, cq, and mr may
> >> fail, and error logs will be printed. When a large number of
> >> resources are destroyed, there will be lots of printings, and it
> >> may lead to a cpu stuck. Replace the printing functions in these
> >> paths with the ratelimited version.
> > 
> > At lease some of them if not most should be deleted.
> > 
> 
> Hi Leon,I wonder if there is a clear standard about whether printing
> can be added?

I don't think so, but there are some guidelines that can help you to do it:
1. Don't print error messages in the fast path.
2. Don't print error messages if other function down in the stack already
   printed it.
3. Don't print error messages if it is possible to trigger them from
unprivileged user.
...

> 
> Thanks,
> Junxian
> 
> > Thanks
Junxian Huang Sept. 12, 2024, 1:04 a.m. UTC | #4
On 2024/9/11 21:25, Leon Romanovsky wrote:
> On Wed, Sep 11, 2024 at 09:34:19AM +0800, Junxian Huang wrote:
>>
>>
>> On 2024/9/10 21:09, Leon Romanovsky wrote:
>>> On Fri, Sep 06, 2024 at 05:34:38PM +0800, Junxian Huang wrote:
>>>> From: wenglianfa <wenglianfa@huawei.com>
>>>>
>>>> During reset, cmd to destroy resources such as qp, cq, and mr may
>>>> fail, and error logs will be printed. When a large number of
>>>> resources are destroyed, there will be lots of printings, and it
>>>> may lead to a cpu stuck. Replace the printing functions in these
>>>> paths with the ratelimited version.
>>>
>>> At lease some of them if not most should be deleted.
>>>
>>
>> Hi Leon,I wonder if there is a clear standard about whether printing
>> can be added?
> 
> I don't think so, but there are some guidelines that can help you to do it:
> 1. Don't print error messages in the fast path.
> 2. Don't print error messages if other function down in the stack already
>    printed it.
> 3. Don't print error messages if it is possible to trigger them from
> unprivileged user.
> ...
> 

Thanks

>>
>> Thanks,
>> Junxian
>>
>>> Thanks
diff mbox series

Patch

diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index 4ec66611a143..4106423a1b39 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -179,8 +179,8 @@  static void free_cqc(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq)
 	ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_CQC,
 				      hr_cq->cqn);
 	if (ret)
-		dev_err(dev, "DESTROY_CQ failed (%d) for CQN %06lx\n", ret,
-			hr_cq->cqn);
+		dev_err_ratelimited(dev, "DESTROY_CQ failed (%d) for CQN %06lx\n",
+				    ret, hr_cq->cqn);
 
 	xa_erase_irq(&cq_table->array, hr_cq->cqn);
 
diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c b/drivers/infiniband/hw/hns/hns_roce_hem.c
index 02baa853a76c..496584139240 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hem.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
@@ -672,8 +672,8 @@  void hns_roce_table_put(struct hns_roce_dev *hr_dev,
 
 	ret = hr_dev->hw->clear_hem(hr_dev, table, obj, HEM_HOP_STEP_DIRECT);
 	if (ret)
-		dev_warn(dev, "failed to clear HEM base address, ret = %d.\n",
-			 ret);
+		dev_warn_ratelimited(dev, "failed to clear HEM base address, ret = %d.\n",
+				     ret);
 
 	hns_roce_free_hem(hr_dev, table->hem[i]);
 	table->hem[i] = NULL;
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 2225c9cc6366..adcadd2495ab 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -378,12 +378,12 @@  static int check_send_valid(struct hns_roce_dev *hr_dev,
 	if (unlikely(hr_qp->state == IB_QPS_RESET ||
 		     hr_qp->state == IB_QPS_INIT ||
 		     hr_qp->state == IB_QPS_RTR)) {
-		ibdev_err(ibdev, "failed to post WQE, QP state %u!\n",
-			  hr_qp->state);
+		ibdev_err_ratelimited(ibdev, "failed to post WQE, QP state %u!\n",
+				      hr_qp->state);
 		return -EINVAL;
 	} else if (unlikely(hr_dev->state >= HNS_ROCE_DEVICE_STATE_RST_DOWN)) {
-		ibdev_err(ibdev, "failed to post WQE, dev state %d!\n",
-			  hr_dev->state);
+		ibdev_err_ratelimited(ibdev, "failed to post WQE, dev state %d!\n",
+				      hr_dev->state);
 		return -EIO;
 	}
 
@@ -2775,8 +2775,8 @@  static int free_mr_modify_rsv_qp(struct hns_roce_dev *hr_dev,
 	ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, attr, mask, IB_QPS_INIT,
 				    IB_QPS_INIT, NULL);
 	if (ret) {
-		ibdev_err(ibdev, "failed to modify qp to init, ret = %d.\n",
-			  ret);
+		ibdev_err_ratelimited(ibdev, "failed to modify qp to init, ret = %d.\n",
+				      ret);
 		return ret;
 	}
 
@@ -3421,8 +3421,8 @@  static int free_mr_post_send_lp_wqe(struct hns_roce_qp *hr_qp)
 
 	ret = hns_roce_v2_post_send(&hr_qp->ibqp, send_wr, &bad_wr);
 	if (ret) {
-		ibdev_err(ibdev, "failed to post wqe for free mr, ret = %d.\n",
-			  ret);
+		ibdev_err_ratelimited(ibdev, "failed to post wqe for free mr, ret = %d.\n",
+				      ret);
 		return ret;
 	}
 
@@ -3461,9 +3461,9 @@  static void free_mr_send_cmd_to_hw(struct hns_roce_dev *hr_dev)
 
 		ret = free_mr_post_send_lp_wqe(hr_qp);
 		if (ret) {
-			ibdev_err(ibdev,
-				  "failed to send wqe (qp:0x%lx) for free mr, ret = %d.\n",
-				  hr_qp->qpn, ret);
+			ibdev_err_ratelimited(ibdev,
+					      "failed to send wqe (qp:0x%lx) for free mr, ret = %d.\n",
+					      hr_qp->qpn, ret);
 			break;
 		}
 
@@ -3474,16 +3474,16 @@  static void free_mr_send_cmd_to_hw(struct hns_roce_dev *hr_dev)
 	while (cqe_cnt) {
 		npolled = hns_roce_v2_poll_cq(&free_mr->rsv_cq->ib_cq, cqe_cnt, wc);
 		if (npolled < 0) {
-			ibdev_err(ibdev,
-				  "failed to poll cqe for free mr, remain %d cqe.\n",
-				  cqe_cnt);
+			ibdev_err_ratelimited(ibdev,
+					      "failed to poll cqe for free mr, remain %d cqe.\n",
+					      cqe_cnt);
 			goto out;
 		}
 
 		if (time_after(jiffies, end)) {
-			ibdev_err(ibdev,
-				  "failed to poll cqe for free mr and timeout, remain %d cqe.\n",
-				  cqe_cnt);
+			ibdev_err_ratelimited(ibdev,
+					      "failed to poll cqe for free mr and timeout, remain %d cqe.\n",
+					      cqe_cnt);
 			goto out;
 		}
 		cqe_cnt -= npolled;
@@ -5062,7 +5062,8 @@  static int hns_roce_v2_set_abs_fields(struct ib_qp *ibqp,
 	int ret = 0;
 
 	if (!check_qp_state(cur_state, new_state)) {
-		ibdev_err(&hr_dev->ib_dev, "Illegal state for QP!\n");
+		ibdev_err_ratelimited(&hr_dev->ib_dev,
+				      "Illegal state for QP!\n");
 		return -EINVAL;
 	}
 
@@ -5325,7 +5326,7 @@  static int hns_roce_v2_modify_qp(struct ib_qp *ibqp,
 	/* SW pass context to HW */
 	ret = hns_roce_v2_qp_modify(hr_dev, context, qpc_mask, hr_qp);
 	if (ret) {
-		ibdev_err(ibdev, "failed to modify QP, ret = %d.\n", ret);
+		ibdev_err_ratelimited(ibdev, "failed to modify QP, ret = %d.\n", ret);
 		goto out;
 	}
 
@@ -5463,7 +5464,9 @@  static int hns_roce_v2_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr,
 
 	ret = hns_roce_v2_query_qpc(hr_dev, hr_qp->qpn, &context);
 	if (ret) {
-		ibdev_err(ibdev, "failed to query QPC, ret = %d.\n", ret);
+		ibdev_err_ratelimited(ibdev,
+				      "failed to query QPC, ret = %d.\n",
+				      ret);
 		ret = -EINVAL;
 		goto out;
 	}
@@ -5471,7 +5474,7 @@  static int hns_roce_v2_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr,
 	state = hr_reg_read(&context, QPC_QP_ST);
 	tmp_qp_state = to_ib_qp_st((enum hns_roce_v2_qp_state)state);
 	if (tmp_qp_state == -1) {
-		ibdev_err(ibdev, "Illegal ib_qp_state\n");
+		ibdev_err_ratelimited(ibdev, "Illegal ib_qp_state\n");
 		ret = -EINVAL;
 		goto out;
 	}
@@ -5564,9 +5567,9 @@  static int hns_roce_v2_destroy_qp_common(struct hns_roce_dev *hr_dev,
 		ret = hns_roce_v2_modify_qp(&hr_qp->ibqp, NULL, 0,
 					    hr_qp->state, IB_QPS_RESET, udata);
 		if (ret)
-			ibdev_err(ibdev,
-				  "failed to modify QP to RST, ret = %d.\n",
-				  ret);
+			ibdev_err_ratelimited(ibdev,
+					      "failed to modify QP to RST, ret = %d.\n",
+					      ret);
 	}
 
 	send_cq = hr_qp->ibqp.send_cq ? to_hr_cq(hr_qp->ibqp.send_cq) : NULL;
@@ -5602,9 +5605,9 @@  int hns_roce_v2_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
 
 	ret = hns_roce_v2_destroy_qp_common(hr_dev, hr_qp, udata);
 	if (ret)
-		ibdev_err(&hr_dev->ib_dev,
-			  "failed to destroy QP, QPN = 0x%06lx, ret = %d.\n",
-			  hr_qp->qpn, ret);
+		ibdev_err_ratelimited(&hr_dev->ib_dev,
+				      "failed to destroy QP, QPN = 0x%06lx, ret = %d.\n",
+				      hr_qp->qpn, ret);
 
 	hns_roce_qp_destroy(hr_dev, hr_qp, udata);
 
@@ -5898,9 +5901,9 @@  static int hns_roce_v2_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period)
 				HNS_ROCE_CMD_MODIFY_CQC, hr_cq->cqn);
 	hns_roce_free_cmd_mailbox(hr_dev, mailbox);
 	if (ret)
-		ibdev_err(&hr_dev->ib_dev,
-			  "failed to process cmd when modifying CQ, ret = %d.\n",
-			  ret);
+		ibdev_err_ratelimited(&hr_dev->ib_dev,
+				      "failed to process cmd when modifying CQ, ret = %d.\n",
+				      ret);
 
 err_out:
 	if (ret)
@@ -5924,9 +5927,9 @@  static int hns_roce_v2_query_cqc(struct hns_roce_dev *hr_dev, u32 cqn,
 	ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma,
 				HNS_ROCE_CMD_QUERY_CQC, cqn);
 	if (ret) {
-		ibdev_err(&hr_dev->ib_dev,
-			  "failed to process cmd when querying CQ, ret = %d.\n",
-			  ret);
+		ibdev_err_ratelimited(&hr_dev->ib_dev,
+				      "failed to process cmd when querying CQ, ret = %d.\n",
+				      ret);
 		goto err_mailbox;
 	}
 
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index 846da8c78b8b..b3f4327d0e64 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -138,8 +138,8 @@  static void hns_roce_mr_free(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr
 					      key_to_hw_index(mr->key) &
 					      (hr_dev->caps.num_mtpts - 1));
 		if (ret)
-			ibdev_warn(ibdev, "failed to destroy mpt, ret = %d.\n",
-				   ret);
+			ibdev_warn_ratelimited(ibdev, "failed to destroy mpt, ret = %d.\n",
+					       ret);
 	}
 
 	free_mr_pbl(hr_dev, mr);
diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c
index c9b8233f4b05..70c06ef65603 100644
--- a/drivers/infiniband/hw/hns/hns_roce_srq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_srq.c
@@ -151,8 +151,8 @@  static void free_srqc(struct hns_roce_dev *hr_dev, struct hns_roce_srq *srq)
 	ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_SRQ,
 				      srq->srqn);
 	if (ret)
-		dev_err(hr_dev->dev, "DESTROY_SRQ failed (%d) for SRQN %06lx\n",
-			ret, srq->srqn);
+		dev_err_ratelimited(hr_dev->dev, "DESTROY_SRQ failed (%d) for SRQN %06lx\n",
+				    ret, srq->srqn);
 
 	xa_erase_irq(&srq_table->xa, srq->srqn);