IB/mlx4: Fail post send command on error recovery

Message ID	1364496315-7588-1-git-send-email-klebers@linux.vnet.ibm.com (mailing list archive)
State	Rejected
Headers	show Return-Path: <linux-rdma-owner@vger.kernel.org> Gateway: Authorized Use Only! Violators will be prosecuted for <linux-rdma@vger.kernel.org> from <klebers@linux.vnet.ibm.com>; Thu, 28 Mar 2013 15:49:27 -0300 Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 28 Mar 2013 15:49:25 -0300 From: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> To: linux-rdma@vger.kernel.org Cc: Roland Dreier <roland@kernel.org>, Sean Hefty <sean.hefty@intel.com>, Hal Rosenstock <hal.rosenstock@gmail.com>, Or Gerlitz <ogerlitz@mellanox.com>, Jack Morgenstein <jackm@dev.mellanox.co.il>, Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> Subject: [PATCH] IB/mlx4: Fail post send command on error recovery Date: Thu, 28 Mar 2013 15:45:15 -0300 Message-Id: <1364496315-7588-1-git-send-email-klebers@linux.vnet.ibm.com> Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Kleber Sacilotto de Souza March 28, 2013, 6:45 p.m. UTC

When the PCI adapter is going through error recovery, a call to
mlx4_ib_post_send() will return success without the command actually
arriving to the hardware. Adding a call to pci_channel_offline() to
check the state of the PCI slot and returning an error will allow the
upper layers to be aware that the command didn't succeed.

Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
---
 drivers/infiniband/hw/mlx4/qp.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

Or Gerlitz April 2, 2013, 9:15 a.m. UTC | #1

On 28/03/2013 20:45, Kleber Sacilotto de Souza wrote:
> When the PCI adapter is going through error recovery, a call to
> mlx4_ib_post_send() will return success without the command actually
> arriving to the hardware. Adding a call to pci_channel_offline() to
> check the state of the PCI slot and returning an error will allow the
> upper layers to be aware that the command didn't succeed.

Is putting this call in fast path common practice in other (e.g 
Ethernet) drivers?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

jackm April 2, 2013, 11:24 a.m. UTC | #2

On Tuesday 02 April 2013 12:15, Or Gerlitz wrote:
> On 28/03/2013 20:45, Kleber Sacilotto de Souza wrote:
> > When the PCI adapter is going through error recovery, a call to
> > mlx4_ib_post_send() will return success without the command actually
> > arriving to the hardware. Adding a call to pci_channel_offline() to
> > check the state of the PCI slot and returning an error will allow the
> > upper layers to be aware that the command didn't succeed.
> 
> Is putting this call in fast path common practice in other (e.g 
> Ethernet) drivers?
> 
> Or.
> 
In addition, have you done any timing tests to see how this affects the
data path?  I am very concerned here that you penalize normal performance
for the sake of a very rare corner case.

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Roland Dreier April 2, 2013, 5 p.m. UTC | #3

> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
> index 35cced2..0fa4f72 100644
> --- a/drivers/infiniband/hw/mlx4/qp.c
> +++ b/drivers/infiniband/hw/mlx4/qp.c
> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
>         __be32 blh;
>         int i;
>
> +       if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev))
> +               return -EIO;
> +
>         spin_lock_irqsave(&qp->sq.lock, flags);
>
>         ind = qp->sq_next_wqe;

To pile on to what Or and Jack asked, why here?  Why not in post_recv?
 Why not in mlx4_en?  What about userspace consumers?  What if the
error condition triggers just after the pci_channel_offline() check?
What if a command is queued but a PCI error occurs before the
completion can be returned?

Is there some practical scenario where this change makes a difference?

I would assume that in case of a PCI error, the driver would notice a
catastrophic error and send that asynchronous event to consumers, who
would know that commands might have been lost.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kleber Sacilotto de Souza April 4, 2013, 1:01 p.m. UTC | #4

On 04/02/2013 02:00 PM, Roland Dreier wrote:
>> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
>> index 35cced2..0fa4f72 100644
>> --- a/drivers/infiniband/hw/mlx4/qp.c
>> +++ b/drivers/infiniband/hw/mlx4/qp.c
>> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
>>         __be32 blh;
>>         int i;
>>
>> +       if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev))
>> +               return -EIO;
>> +
>>         spin_lock_irqsave(&qp->sq.lock, flags);
>>
>>         ind = qp->sq_next_wqe;
> 
> To pile on to what Or and Jack asked, why here?  Why not in post_recv?
>  Why not in mlx4_en?  What about userspace consumers?  What if the
> error condition triggers just after the pci_channel_offline() check?
> What if a command is queued but a PCI error occurs before the
> completion can be returned?
> 
> Is there some practical scenario where this change makes a difference?
> 
> I would assume that in case of a PCI error, the driver would notice a
> catastrophic error and send that asynchronous event to consumers, who
> would know that commands might have been lost.
> 

The problem that I'm trying to solve is that some IB core modules are
hanging waiting on completion queues on their remove path during error
recovery. I've added the pci offline check in post_send, which seemed to
have to solved the problem, but while running other tests I was able to
hit the bug again. Adding the check in post_recv also only hid the
problem for a few testcases.

Adding any check in mlx4_en doesn't make sense in this case, because the
problem is only with IB adapters. The ethernet/RoCE adapters are
recovering fine, the check has been added already on the relevant places
in mlx4_core.

What async event should be sent to consumers before calling the remove
functions? IB_EVENT_DEVICE_FATAL, which is currently sent by mlx4_core
in case of catastrophic error (but not in PCI error recovery), doesn't
seem to be handled by most of the event handlers registered. Sending
IB_EVENT_PORT_ERR seems to solve the problem for most modules, but
rdma_cm, which doesn't have an event handler, is still hanging. Should
we implement an event handler for rdma_cm?

Thanks!

jackm April 4, 2013, 2 p.m. UTC | #5

On Thursday 04 April 2013 16:01, Kleber Sacilotto de Souza wrote:
> On 04/02/2013 02:00 PM, Roland Dreier wrote:
> >> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
> >> index 35cced2..0fa4f72 100644
> >> --- a/drivers/infiniband/hw/mlx4/qp.c
> >> +++ b/drivers/infiniband/hw/mlx4/qp.c
> >> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
> >>         __be32 blh;
> >>         int i;
> >>
> >> +       if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev))
> >> +               return -EIO;
> >> +
> >>         spin_lock_irqsave(&qp->sq.lock, flags);
> >>
> >>         ind = qp->sq_next_wqe;
> > 
> > To pile on to what Or and Jack asked, why here?  Why not in post_recv?
> >  Why not in mlx4_en?  What about userspace consumers?  What if the
> > error condition triggers just after the pci_channel_offline() check?
> > What if a command is queued but a PCI error occurs before the
> > completion can be returned?
> > 
> > Is there some practical scenario where this change makes a difference?
> > 
> > I would assume that in case of a PCI error, the driver would notice a
> > catastrophic error and send that asynchronous event to consumers, who
> > would know that commands might have been lost.
> > 
> 
> The problem that I'm trying to solve is that some IB core modules are
> hanging waiting on completion queues on their remove path during error
> recovery. I've added the pci offline check in post_send, which seemed to
> have to solved the problem, but while running other tests I was able to
> hit the bug again. Adding the check in post_recv also only hid the
> problem for a few testcases.
> 
> Adding any check in mlx4_en doesn't make sense in this case, because the
> problem is only with IB adapters. The ethernet/RoCE adapters are
> recovering fine, the check has been added already on the relevant places
> in mlx4_core.
> 
> What async event should be sent to consumers before calling the remove
> functions? IB_EVENT_DEVICE_FATAL, which is currently sent by mlx4_core
> in case of catastrophic error (but not in PCI error recovery), doesn't
> seem to be handled by most of the event handlers registered. Sending
> IB_EVENT_PORT_ERR seems to solve the problem for most modules, but
> rdma_cm, which doesn't have an event handler, is still hanging. Should
> we implement an event handler for rdma_cm?
>

This won't really help unless ALL userspace apps respond by calling ibv_close_device.
You can check this by running ibv_asyncwatch  (in libibverbs/examples). Until ibv_asyncwatch
is exited the low-level device restart won't work.

-Jack
> 
> Thanks!
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz April 4, 2013, 9:45 p.m. UTC | #6

On Thu, Apr 4, 2013 at 5:00 PM, Jack Morgenstein <jackm@dev.mellanox.co.il>

> This won't really help unless ALL userspace apps respond by calling
> ibv_close_device. You can check this by running ibv_asyncwatch  (in
> libibverbs/examples). Until ibv_asyncwatch
> is exited the low-level device restart won't work.

Jack, so we have two different problems here, the kernel ULPs trying
to reap completions from their CQs which will never happen, and what
you pointed on under which currently the device will not be destroyed.
I think its OK to handle them one @ a time...

Kleber , as for the 1st problem, which kernel consumers are hanging
for ever on their CQs? IPoIB is giving up after sometime e.g see in
ipoib_ib.c "assume the HW is wedged and just free up all our pending
work requests"

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Or Gerlitz April 4, 2013, 9:45 p.m. UTC | #7

On Thu, Apr 4, 2013 at 4:01 PM, Kleber Sacilotto de Souza

> The problem that I'm trying to solve is that some IB core modules are
> hanging waiting on completion queues on their remove path during error recovery.

So maybe patch them to give up after some time?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Roland Dreier April 4, 2013, 9:49 p.m. UTC | #8

On Thu, Apr 4, 2013 at 2:45 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
>> The problem that I'm trying to solve is that some IB core modules are
>> hanging waiting on completion queues on their remove path during error recovery.

> So maybe patch them to give up after some time?

I don't know so much about this PCI error recovery stuff but it does
seem sensible to trigger a catastrophic error async event when it
happens (I'm assuming the recovery mechanism resets the adapter).

Then we should fix at least kernel ULPs behave appropriately when they
get such an async event.  And similarly if someone wants to harden
some subset of userspace apps to handle PCI error recovery too, that
would be another step forward.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kleber Sacilotto de Souza April 8, 2013, 1:51 p.m. UTC | #9

On 04/04/2013 06:45 PM, Or Gerlitz wrote:
> 
> Kleber , as for the 1st problem, which kernel consumers are hanging
> for ever on their CQs? IPoIB is giving up after sometime e.g see in
> ipoib_ib.c "assume the HW is wedged and just free up all our pending
> work requests"
> 

Or, I don't have a very comprehensive testcase to stress most part of
the IB stack during error recovery, but during my tests the kernel
consumer that are still hanging is the ib_sa module, mcast_remove_one()
is waiting for the port completion queue:

INFO: task eehd:4689 blocked for more than 30 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
eehd            D 0000000000000000     0  4689      2 0x00010000
Call Trace:
[c0000000fba83190] [0000000000000001] 0x1 (unreliable)
[c0000000fba83360] [c000000000016188] .__switch_to+0x140/0x268
[c0000000fba83410] [c000000000674f28] .__schedule+0x570/0x8f0
[c0000000fba836b0] [c000000000675bc4] .schedule_timeout+0x334/0x3c8
[c0000000fba837c0] [c000000000674738] .wait_for_common+0x1c0/0x238
[c0000000fba838a0] [d000000002ca230c] .mcast_remove_one+0xfc/0x168 [ib_sa]
[c0000000fba83940] [d000000002bc4f60] .ib_unregister_device+0x78/0x170
[ib_core]
...

Or rdma_cm waiting for the cma_dev completion:

Call Trace:
[c0000000f8fc70f0] [0000000000000001] 0x1 (unreliable)
[c0000000f8fc72c0] [c000000000016188] .__switch_to+0x140/0x268
[c0000000f8fc7370] [c000000000674f28] .__schedule+0x570/0x8f0
[c0000000f8fc7610] [c000000000675bc4] .schedule_timeout+0x334/0x3c8
[c0000000f8fc7720] [c000000000674738] .wait_for_common+0x1c0/0x238
[c0000000f8fc7800] [d000000002f835b0] .cma_process_remove+0x170/0x1a8
[rdma_cm]
[c0000000f8fc78b0] [d000000002f8366c] .cma_remove_one+0x84/0xb0 [rdma_cm]
[c0000000f8fc7940] [d000000002c34f60] .ib_unregister_device+0x78/0x170
[ib_core]
...


Thanks,
kleber

Kleber Sacilotto de Souza April 8, 2013, 2:07 p.m. UTC | #10

On 04/04/2013 06:49 PM, Roland Dreier wrote:
> 
> I don't know so much about this PCI error recovery stuff but it does
> seem sensible to trigger a catastrophic error async event when it
> happens (I'm assuming the recovery mechanism resets the adapter).

The PCI error recovery in the powerpc architecture, which is where I'm
focusing, works by identifying a misbehaving adapter and freezing its
slot, so that all MMIO writes to that device will be ignored and reads
will return all 1's. When that happens the Linux implementation will
invoke some callbacks on the driver (in this case mlx4_core) to recover
from the error, and reset the slot. The most common procedure is the
driver to remove the adapter and add it back, which is what the mlx4_ib
is trying to do.

> 
> Then we should fix at least kernel ULPs behave appropriately when they
> get such an async event.  And similarly if someone wants to harden
> some subset of userspace apps to handle PCI error recovery too, that
> would be another step forward.
> 

I agree, this seems to be what is missing to have the error recovery
fully functional.

Thanks,

Or Gerlitz April 8, 2013, 3:47 p.m. UTC | #11

On Mon, Apr 8, 2013 at 4:51 PM, Kleber Sacilotto de Souza
<klebers@linux.vnet.ibm.com> wrote:

> Or, I don't have a very comprehensive testcase to stress most part of
> the IB stack during error recovery, but during my tests the kernel
> consumer that are still hanging is the ib_sa module, mcast_remove_one()
> is waiting for the port completion queue:

best if you can send the whole relevant picture, e.g use the sysrq "t"
dump when things are hanged and send the stack trace of all (or all
involved) kernel-threads/processes

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

IB/mlx4: Fail post send command on error recovery

Commit Message

Comments

Patch