diff mbox series

[rdma-next,4/4] nvme-rdma: add more error details when a QP moves to an error state

Message ID 20220907113800.22182-5-phaddad@nvidia.com (mailing list archive)
State Changes Requested
Headers show
Series Provide more error details when a QP moves to | expand

Commit Message

Patrisious Haddad Sept. 7, 2022, 11:38 a.m. UTC
From: Israel Rukshin <israelr@nvidia.com>

Add debug prints for fatal QP events that are helpful for finding the
root cause of the errors. The ib_get_qp_err_syndrome is called at
a work queue since the QP event callback is running on an
interrupt context that can't sleep.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/nvme/host/rdma.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

Comments

Christoph Hellwig Sept. 7, 2022, 12:02 p.m. UTC | #1
On Wed, Sep 07, 2022 at 02:38:00PM +0300, Patrisious Haddad wrote:
> From: Israel Rukshin <israelr@nvidia.com>
> 
> Add debug prints for fatal QP events that are helpful for finding the
> root cause of the errors. The ib_get_qp_err_syndrome is called at
> a work queue since the QP event callback is running on an
> interrupt context that can't sleep.

What an awkward interface.  What prevents us from allowing 
ib_get_qp_err_syndrome to be called from arbitrary calling contexts,
or even better just delivering the error directly as part of the
event?
Leon Romanovsky Sept. 7, 2022, 12:11 p.m. UTC | #2
On Wed, Sep 07, 2022 at 02:02:00PM +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 02:38:00PM +0300, Patrisious Haddad wrote:
> > From: Israel Rukshin <israelr@nvidia.com>
> > 
> > Add debug prints for fatal QP events that are helpful for finding the
> > root cause of the errors. The ib_get_qp_err_syndrome is called at
> > a work queue since the QP event callback is running on an
> > interrupt context that can't sleep.
> 
> What an awkward interface.  What prevents us from allowing 
> ib_get_qp_err_syndrome to be called from arbitrary calling contexts,
> or even better just delivering the error directly as part of the
> event?

We need to call to our FW through command interface and unfortunately it
is not possible to do in atomic context.

Thanks
Sagi Grimberg Sept. 7, 2022, 12:34 p.m. UTC | #3
> From: Israel Rukshin <israelr@nvidia.com>
> 
> Add debug prints for fatal QP events that are helpful for finding the
> root cause of the errors. The ib_get_qp_err_syndrome is called at
> a work queue since the QP event callback is running on an
> interrupt context that can't sleep.
> 
> Signed-off-by: Israel Rukshin <israelr@nvidia.com>
> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>

What makes nvme-rdma special here? Why do you get this in
nvme-rdma and not srp/iser/nfs-rdma/rds/smc/ipoib etc?

This entire code needs to move to the rdma core instead
of being leaked to ulps.
Leon Romanovsky Sept. 7, 2022, 12:51 p.m. UTC | #4
On Wed, Sep 07, 2022 at 03:34:21PM +0300, Sagi Grimberg wrote:
> 
> > From: Israel Rukshin <israelr@nvidia.com>
> > 
> > Add debug prints for fatal QP events that are helpful for finding the
> > root cause of the errors. The ib_get_qp_err_syndrome is called at
> > a work queue since the QP event callback is running on an
> > interrupt context that can't sleep.
> > 
> > Signed-off-by: Israel Rukshin <israelr@nvidia.com>
> > Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> > Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> 
> What makes nvme-rdma special here? Why do you get this in
> nvme-rdma and not srp/iser/nfs-rdma/rds/smc/ipoib etc?
> 
> This entire code needs to move to the rdma core instead
> of being leaked to ulps.

We can move, but you will lose connection between queue number,
caller and error itself.

As I answered to Christoph, we will need to execute query QP command
in a workqueue outside of event handler.

So you will get a print about queue in error state and later you will
see parsed error print somewhere in the dmesg.

Thanks
Sagi Grimberg Sept. 7, 2022, 3:16 p.m. UTC | #5
>>> From: Israel Rukshin <israelr@nvidia.com>
>>>
>>> Add debug prints for fatal QP events that are helpful for finding the
>>> root cause of the errors. The ib_get_qp_err_syndrome is called at
>>> a work queue since the QP event callback is running on an
>>> interrupt context that can't sleep.
>>>
>>> Signed-off-by: Israel Rukshin <israelr@nvidia.com>
>>> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
>>> Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
>>
>> What makes nvme-rdma special here? Why do you get this in
>> nvme-rdma and not srp/iser/nfs-rdma/rds/smc/ipoib etc?
>>
>> This entire code needs to move to the rdma core instead
>> of being leaked to ulps.
> 
> We can move, but you will lose connection between queue number,
> caller and error itself.

That still doesn't explain why nvme-rdma is special.

In any event, the ulp can log the qpn so the context can be interrogated
if that is important.
Christoph Hellwig Sept. 7, 2022, 3:18 p.m. UTC | #6
On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
>>>
>>> This entire code needs to move to the rdma core instead
>>> of being leaked to ulps.
>>
>> We can move, but you will lose connection between queue number,
>> caller and error itself.
>
> That still doesn't explain why nvme-rdma is special.
>
> In any event, the ulp can log the qpn so the context can be interrogated
> if that is important.

I also don't see why the QP event handler can't be called
from user context to start with.  I see absolutely no reason to
add boilerplate code to drivers for reporting slighly more verbose
errors on one specific piece of hrdware.  I'd say clean up the mess
that is the QP event handler first, and then once error reporting
becomes trivial we can just do it.
Leon Romanovsky Sept. 7, 2022, 5:29 p.m. UTC | #7
On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
> 
> > > > From: Israel Rukshin <israelr@nvidia.com>
> > > > 
> > > > Add debug prints for fatal QP events that are helpful for finding the
> > > > root cause of the errors. The ib_get_qp_err_syndrome is called at
> > > > a work queue since the QP event callback is running on an
> > > > interrupt context that can't sleep.
> > > > 
> > > > Signed-off-by: Israel Rukshin <israelr@nvidia.com>
> > > > Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > What makes nvme-rdma special here? Why do you get this in
> > > nvme-rdma and not srp/iser/nfs-rdma/rds/smc/ipoib etc?
> > > 
> > > This entire code needs to move to the rdma core instead
> > > of being leaked to ulps.
> > 
> > We can move, but you will lose connection between queue number,
> > caller and error itself.
> 
> That still doesn't explain why nvme-rdma is special.

It was important for us to get proper review from at least one ULP,
nvme-rdma is not special at all.

> 
> In any event, the ulp can log the qpn so the context can be interrogated
> if that is important.

ok
Leon Romanovsky Sept. 7, 2022, 5:39 p.m. UTC | #8
On Wed, Sep 07, 2022 at 05:18:18PM +0200, Christoph Hellwig wrote:
> On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
> >>>
> >>> This entire code needs to move to the rdma core instead
> >>> of being leaked to ulps.
> >>
> >> We can move, but you will lose connection between queue number,
> >> caller and error itself.
> >
> > That still doesn't explain why nvme-rdma is special.
> >
> > In any event, the ulp can log the qpn so the context can be interrogated
> > if that is important.
> 
> I also don't see why the QP event handler can't be called
> from user context to start with.  I see absolutely no reason to
> add boilerplate code to drivers for reporting slighly more verbose
> errors on one specific piece of hrdware.  I'd say clean up the mess
> that is the QP event handler first, and then once error reporting
> becomes trivial we can just do it.

I don't know, Chuck documented it in 2018:
eb93c82ed8c7 ("RDMA/core: Document QP @event_handler function")

  1164 struct ib_qp_init_attr {
  1165         /* Consumer's event_handler callback must not block */
  1166         void                  (*event_handler)(struct ib_event *, void *);

Thanks
Patrisious Haddad Sept. 8, 2022, 7:55 a.m. UTC | #9
On 07/09/2022 18:18, Christoph Hellwig wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
>>>>
>>>> This entire code needs to move to the rdma core instead
>>>> of being leaked to ulps.
>>>
>>> We can move, but you will lose connection between queue number,
>>> caller and error itself.
>>
>> That still doesn't explain why nvme-rdma is special.
>>
>> In any event, the ulp can log the qpn so the context can be interrogated
>> if that is important.
> 
> I also don't see why the QP event handler can't be called
> from user context to start with.  I see absolutely no reason to
> add boilerplate code to drivers for reporting slighly more verbose
> errors on one specific piece of hrdware.  I'd say clean up the mess
> that is the QP event handler first, and then once error reporting
> becomes trivial we can just do it.

I would like to emphasize that it is not just about slightly more 
verbose error, but mainly it is about an error that wouldn't have been 
reported at all without this feature, as I previously mentioned error 
cases in which the remote side doesn't generate a CQE, the remote side 
wouldn't even know why the QP was moved to error state without this feature.
Mark Zhang Nov. 1, 2022, 9:12 a.m. UTC | #10
On 9/8/2022 1:39 AM, Leon Romanovsky wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Sep 07, 2022 at 05:18:18PM +0200, Christoph Hellwig wrote:
>> On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
>>>>>
>>>>> This entire code needs to move to the rdma core instead
>>>>> of being leaked to ulps.
>>>>
>>>> We can move, but you will lose connection between queue number,
>>>> caller and error itself.
>>>
>>> That still doesn't explain why nvme-rdma is special.
>>>
>>> In any event, the ulp can log the qpn so the context can be interrogated
>>> if that is important.
>>
>> I also don't see why the QP event handler can't be called
>> from user context to start with.  I see absolutely no reason to
>> add boilerplate code to drivers for reporting slighly more verbose
>> errors on one specific piece of hrdware.  I'd say clean up the mess
>> that is the QP event handler first, and then once error reporting
>> becomes trivial we can just do it.
> 
> I don't know, Chuck documented it in 2018:
> eb93c82ed8c7 ("RDMA/core: Document QP @event_handler function")
> 
>    1164 struct ib_qp_init_attr {
>    1165         /* Consumer's event_handler callback must not block */
>    1166         void                  (*event_handler)(struct ib_event *, void *);

Looks like driver calls it in an atomic way, e.g.:
   mlx5_ib_qp_event() -> ibqp->event_handler(&event, ibqp->qp_context);

Could driver also report it as an IB async event, as WQ_CATAS_ERROR is 
defined as an async event (IB spec C11-39), and QP_FATAL is also an 
event of enum ib_event_type? Is it a problem that one event is reported 
twice?

If it is acceptable then ulp could register this event handler with 
ib_register_event_handler(), which is non-atomic.
Mark Zhang Nov. 2, 2022, 1:56 a.m. UTC | #11
On 11/1/2022 5:12 PM, Mark Zhang wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/8/2022 1:39 AM, Leon Romanovsky wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On Wed, Sep 07, 2022 at 05:18:18PM +0200, Christoph Hellwig wrote:
>>> On Wed, Sep 07, 2022 at 06:16:05PM +0300, Sagi Grimberg wrote:
>>>>>>
>>>>>> This entire code needs to move to the rdma core instead
>>>>>> of being leaked to ulps.
>>>>>
>>>>> We can move, but you will lose connection between queue number,
>>>>> caller and error itself.
>>>>
>>>> That still doesn't explain why nvme-rdma is special.
>>>>
>>>> In any event, the ulp can log the qpn so the context can be 
>>>> interrogated
>>>> if that is important.
>>>
>>> I also don't see why the QP event handler can't be called
>>> from user context to start with.  I see absolutely no reason to
>>> add boilerplate code to drivers for reporting slighly more verbose
>>> errors on one specific piece of hrdware.  I'd say clean up the mess
>>> that is the QP event handler first, and then once error reporting
>>> becomes trivial we can just do it.
>>
>> I don't know, Chuck documented it in 2018:
>> eb93c82ed8c7 ("RDMA/core: Document QP @event_handler function")
>>
>>    1164 struct ib_qp_init_attr {
>>    1165         /* Consumer's event_handler callback must not block */
>>    1166         void                  (*event_handler)(struct ib_event 
>> *, void *);
> 
> Looks like driver calls it in an atomic way, e.g.:
>    mlx5_ib_qp_event() -> ibqp->event_handler(&event, ibqp->qp_context);
> 
> Could driver also report it as an IB async event, as WQ_CATAS_ERROR is
> defined as an async event (IB spec C11-39), and QP_FATAL is also an
> event of enum ib_event_type? Is it a problem that one event is reported
> twice?
> 
> If it is acceptable then ulp could register this event handler with
> ib_register_event_handler(), which is non-atomic.

Or move qp event handler to non-atomic as Christoph suggested? This 
means to fix the mlx4/mlx5 driver, to call it in a work queue.
diff mbox series

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 3100643be299..7e56c0dbe8ea 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -99,6 +99,7 @@  struct nvme_rdma_queue {
 	bool			pi_support;
 	int			cq_size;
 	struct mutex		queue_lock;
+	struct work_struct	qp_err_work;
 };
 
 struct nvme_rdma_ctrl {
@@ -237,11 +238,31 @@  static struct nvme_rdma_qe *nvme_rdma_alloc_ring(struct ib_device *ibdev,
 	return NULL;
 }
 
+static void nvme_rdma_qp_error_work(struct work_struct *work)
+{
+	struct nvme_rdma_queue *queue = container_of(work,
+			struct nvme_rdma_queue, qp_err_work);
+	int ret;
+	char err[IB_ERR_SYNDROME_LENGTH];
+
+	ret = ib_get_qp_err_syndrome(queue->qp, err);
+	if (ret)
+		return;
+
+	pr_err("Queue %d got QP error syndrome %s\n",
+	       nvme_rdma_queue_idx(queue), err);
+}
+
 static void nvme_rdma_qp_event(struct ib_event *event, void *context)
 {
+	struct nvme_rdma_queue *queue = context;
+
 	pr_debug("QP event %s (%d)\n",
 		 ib_event_msg(event->event), event->event);
 
+	if (event->event == IB_EVENT_QP_FATAL ||
+	    event->event == IB_EVENT_QP_ACCESS_ERR)
+		queue_work(nvme_wq, &queue->qp_err_work);
 }
 
 static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
@@ -261,7 +282,9 @@  static int nvme_rdma_create_qp(struct nvme_rdma_queue *queue, const int factor)
 	struct ib_qp_init_attr init_attr;
 	int ret;
 
+	INIT_WORK(&queue->qp_err_work, nvme_rdma_qp_error_work);
 	memset(&init_attr, 0, sizeof(init_attr));
+	init_attr.qp_context = queue;
 	init_attr.event_handler = nvme_rdma_qp_event;
 	/* +1 for drain */
 	init_attr.cap.max_send_wr = factor * queue->queue_size + 1;
@@ -434,6 +457,7 @@  static void nvme_rdma_destroy_queue_ib(struct nvme_rdma_queue *queue)
 		ib_mr_pool_destroy(queue->qp, &queue->qp->sig_mrs);
 	ib_mr_pool_destroy(queue->qp, &queue->qp->rdma_mrs);
 
+	flush_work(&queue->qp_err_work);
 	/*
 	 * The cm_id object might have been destroyed during RDMA connection
 	 * establishment error flow to avoid getting other cma events, thus