Message ID | 1364496315-7588-1-git-send-email-klebers@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On 28/03/2013 20:45, Kleber Sacilotto de Souza wrote: > When the PCI adapter is going through error recovery, a call to > mlx4_ib_post_send() will return success without the command actually > arriving to the hardware. Adding a call to pci_channel_offline() to > check the state of the PCI slot and returning an error will allow the > upper layers to be aware that the command didn't succeed. Is putting this call in fast path common practice in other (e.g Ethernet) drivers? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 02 April 2013 12:15, Or Gerlitz wrote: > On 28/03/2013 20:45, Kleber Sacilotto de Souza wrote: > > When the PCI adapter is going through error recovery, a call to > > mlx4_ib_post_send() will return success without the command actually > > arriving to the hardware. Adding a call to pci_channel_offline() to > > check the state of the PCI slot and returning an error will allow the > > upper layers to be aware that the command didn't succeed. > > Is putting this call in fast path common practice in other (e.g > Ethernet) drivers? > > Or. > In addition, have you done any timing tests to see how this affects the data path? I am very concerned here that you penalize normal performance for the sake of a very rare corner case. -Jack -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c > index 35cced2..0fa4f72 100644 > --- a/drivers/infiniband/hw/mlx4/qp.c > +++ b/drivers/infiniband/hw/mlx4/qp.c > @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, > __be32 blh; > int i; > > + if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev)) > + return -EIO; > + > spin_lock_irqsave(&qp->sq.lock, flags); > > ind = qp->sq_next_wqe; To pile on to what Or and Jack asked, why here? Why not in post_recv? Why not in mlx4_en? What about userspace consumers? What if the error condition triggers just after the pci_channel_offline() check? What if a command is queued but a PCI error occurs before the completion can be returned? Is there some practical scenario where this change makes a difference? I would assume that in case of a PCI error, the driver would notice a catastrophic error and send that asynchronous event to consumers, who would know that commands might have been lost. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/02/2013 02:00 PM, Roland Dreier wrote: >> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c >> index 35cced2..0fa4f72 100644 >> --- a/drivers/infiniband/hw/mlx4/qp.c >> +++ b/drivers/infiniband/hw/mlx4/qp.c >> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, >> __be32 blh; >> int i; >> >> + if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev)) >> + return -EIO; >> + >> spin_lock_irqsave(&qp->sq.lock, flags); >> >> ind = qp->sq_next_wqe; > > To pile on to what Or and Jack asked, why here? Why not in post_recv? > Why not in mlx4_en? What about userspace consumers? What if the > error condition triggers just after the pci_channel_offline() check? > What if a command is queued but a PCI error occurs before the > completion can be returned? > > Is there some practical scenario where this change makes a difference? > > I would assume that in case of a PCI error, the driver would notice a > catastrophic error and send that asynchronous event to consumers, who > would know that commands might have been lost. > The problem that I'm trying to solve is that some IB core modules are hanging waiting on completion queues on their remove path during error recovery. I've added the pci offline check in post_send, which seemed to have to solved the problem, but while running other tests I was able to hit the bug again. Adding the check in post_recv also only hid the problem for a few testcases. Adding any check in mlx4_en doesn't make sense in this case, because the problem is only with IB adapters. The ethernet/RoCE adapters are recovering fine, the check has been added already on the relevant places in mlx4_core. What async event should be sent to consumers before calling the remove functions? IB_EVENT_DEVICE_FATAL, which is currently sent by mlx4_core in case of catastrophic error (but not in PCI error recovery), doesn't seem to be handled by most of the event handlers registered. Sending IB_EVENT_PORT_ERR seems to solve the problem for most modules, but rdma_cm, which doesn't have an event handler, is still hanging. Should we implement an event handler for rdma_cm? Thanks!
On Thursday 04 April 2013 16:01, Kleber Sacilotto de Souza wrote: > On 04/02/2013 02:00 PM, Roland Dreier wrote: > >> diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c > >> index 35cced2..0fa4f72 100644 > >> --- a/drivers/infiniband/hw/mlx4/qp.c > >> +++ b/drivers/infiniband/hw/mlx4/qp.c > >> @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, > >> __be32 blh; > >> int i; > >> > >> + if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev)) > >> + return -EIO; > >> + > >> spin_lock_irqsave(&qp->sq.lock, flags); > >> > >> ind = qp->sq_next_wqe; > > > > To pile on to what Or and Jack asked, why here? Why not in post_recv? > > Why not in mlx4_en? What about userspace consumers? What if the > > error condition triggers just after the pci_channel_offline() check? > > What if a command is queued but a PCI error occurs before the > > completion can be returned? > > > > Is there some practical scenario where this change makes a difference? > > > > I would assume that in case of a PCI error, the driver would notice a > > catastrophic error and send that asynchronous event to consumers, who > > would know that commands might have been lost. > > > > The problem that I'm trying to solve is that some IB core modules are > hanging waiting on completion queues on their remove path during error > recovery. I've added the pci offline check in post_send, which seemed to > have to solved the problem, but while running other tests I was able to > hit the bug again. Adding the check in post_recv also only hid the > problem for a few testcases. > > Adding any check in mlx4_en doesn't make sense in this case, because the > problem is only with IB adapters. The ethernet/RoCE adapters are > recovering fine, the check has been added already on the relevant places > in mlx4_core. > > What async event should be sent to consumers before calling the remove > functions? IB_EVENT_DEVICE_FATAL, which is currently sent by mlx4_core > in case of catastrophic error (but not in PCI error recovery), doesn't > seem to be handled by most of the event handlers registered. Sending > IB_EVENT_PORT_ERR seems to solve the problem for most modules, but > rdma_cm, which doesn't have an event handler, is still hanging. Should > we implement an event handler for rdma_cm? > This won't really help unless ALL userspace apps respond by calling ibv_close_device. You can check this by running ibv_asyncwatch (in libibverbs/examples). Until ibv_asyncwatch is exited the low-level device restart won't work. -Jack > > Thanks! > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 4, 2013 at 5:00 PM, Jack Morgenstein <jackm@dev.mellanox.co.il> > This won't really help unless ALL userspace apps respond by calling > ibv_close_device. You can check this by running ibv_asyncwatch (in > libibverbs/examples). Until ibv_asyncwatch > is exited the low-level device restart won't work. Jack, so we have two different problems here, the kernel ULPs trying to reap completions from their CQs which will never happen, and what you pointed on under which currently the device will not be destroyed. I think its OK to handle them one @ a time... Kleber , as for the 1st problem, which kernel consumers are hanging for ever on their CQs? IPoIB is giving up after sometime e.g see in ipoib_ib.c "assume the HW is wedged and just free up all our pending work requests" Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 4, 2013 at 4:01 PM, Kleber Sacilotto de Souza > The problem that I'm trying to solve is that some IB core modules are > hanging waiting on completion queues on their remove path during error recovery. So maybe patch them to give up after some time? Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 4, 2013 at 2:45 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote: >> The problem that I'm trying to solve is that some IB core modules are >> hanging waiting on completion queues on their remove path during error recovery. > So maybe patch them to give up after some time? I don't know so much about this PCI error recovery stuff but it does seem sensible to trigger a catastrophic error async event when it happens (I'm assuming the recovery mechanism resets the adapter). Then we should fix at least kernel ULPs behave appropriately when they get such an async event. And similarly if someone wants to harden some subset of userspace apps to handle PCI error recovery too, that would be another step forward. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/04/2013 06:45 PM, Or Gerlitz wrote: > > Kleber , as for the 1st problem, which kernel consumers are hanging > for ever on their CQs? IPoIB is giving up after sometime e.g see in > ipoib_ib.c "assume the HW is wedged and just free up all our pending > work requests" > Or, I don't have a very comprehensive testcase to stress most part of the IB stack during error recovery, but during my tests the kernel consumer that are still hanging is the ib_sa module, mcast_remove_one() is waiting for the port completion queue: INFO: task eehd:4689 blocked for more than 30 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. eehd D 0000000000000000 0 4689 2 0x00010000 Call Trace: [c0000000fba83190] [0000000000000001] 0x1 (unreliable) [c0000000fba83360] [c000000000016188] .__switch_to+0x140/0x268 [c0000000fba83410] [c000000000674f28] .__schedule+0x570/0x8f0 [c0000000fba836b0] [c000000000675bc4] .schedule_timeout+0x334/0x3c8 [c0000000fba837c0] [c000000000674738] .wait_for_common+0x1c0/0x238 [c0000000fba838a0] [d000000002ca230c] .mcast_remove_one+0xfc/0x168 [ib_sa] [c0000000fba83940] [d000000002bc4f60] .ib_unregister_device+0x78/0x170 [ib_core] ... Or rdma_cm waiting for the cma_dev completion: Call Trace: [c0000000f8fc70f0] [0000000000000001] 0x1 (unreliable) [c0000000f8fc72c0] [c000000000016188] .__switch_to+0x140/0x268 [c0000000f8fc7370] [c000000000674f28] .__schedule+0x570/0x8f0 [c0000000f8fc7610] [c000000000675bc4] .schedule_timeout+0x334/0x3c8 [c0000000f8fc7720] [c000000000674738] .wait_for_common+0x1c0/0x238 [c0000000f8fc7800] [d000000002f835b0] .cma_process_remove+0x170/0x1a8 [rdma_cm] [c0000000f8fc78b0] [d000000002f8366c] .cma_remove_one+0x84/0xb0 [rdma_cm] [c0000000f8fc7940] [d000000002c34f60] .ib_unregister_device+0x78/0x170 [ib_core] ... Thanks, kleber
On 04/04/2013 06:49 PM, Roland Dreier wrote: > > I don't know so much about this PCI error recovery stuff but it does > seem sensible to trigger a catastrophic error async event when it > happens (I'm assuming the recovery mechanism resets the adapter). The PCI error recovery in the powerpc architecture, which is where I'm focusing, works by identifying a misbehaving adapter and freezing its slot, so that all MMIO writes to that device will be ignored and reads will return all 1's. When that happens the Linux implementation will invoke some callbacks on the driver (in this case mlx4_core) to recover from the error, and reset the slot. The most common procedure is the driver to remove the adapter and add it back, which is what the mlx4_ib is trying to do. > > Then we should fix at least kernel ULPs behave appropriately when they > get such an async event. And similarly if someone wants to harden > some subset of userspace apps to handle PCI error recovery too, that > would be another step forward. > I agree, this seems to be what is missing to have the error recovery fully functional. Thanks,
On Mon, Apr 8, 2013 at 4:51 PM, Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> wrote: > Or, I don't have a very comprehensive testcase to stress most part of > the IB stack during error recovery, but during my tests the kernel > consumer that are still hanging is the ib_sa module, mcast_remove_one() > is waiting for the port completion queue: best if you can send the whole relevant picture, e.g use the sysrq "t" dump when things are hanged and send the stack trace of all (or all involved) kernel-threads/processes Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 35cced2..0fa4f72 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -2216,6 +2216,9 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, __be32 blh; int i; + if (pci_channel_offline(to_mdev(ibqp->device)->dev->pdev)) + return -EIO; + spin_lock_irqsave(&qp->sq.lock, flags); ind = qp->sq_next_wqe;
When the PCI adapter is going through error recovery, a call to mlx4_ib_post_send() will return success without the command actually arriving to the hardware. Adding a call to pci_channel_offline() to check the state of the PCI slot and returning an error will allow the upper layers to be aware that the command didn't succeed. Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> --- drivers/infiniband/hw/mlx4/qp.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-)