diff mbox

[PATCH/RFC] IPoIB: Free ipoib neigh on path record failure so path rec queries are retried

Message ID

1361814409-6704-1-git-send-email-roland@kernel.org (mailing list archive)

State

Accepted, archived

Headers

show

From: Roland Dreier <roland@kernel.org>
To: linux-rdma@vger.kernel.org
Cc: Roland Dreier <roland@purestorage.com>
Subject: [PATCH/RFC] IPoIB: Free ipoib neigh on path record failure so path
	rec queries are retried
Date: Mon, 25 Feb 2013 09:46:49 -0800
Message-Id: <1361814409-6704-1-git-send-email-roland@kernel.org>
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk

Commit Message

Roland Dreier Feb. 25, 2013, 5:46 p.m. UTC

From: Roland Dreier <roland@purestorage.com>

If IPoIB fails to look up a path record (eg if it tries during an SM
failover when one SM is dead but the new one hasn't taken over yet), the
driver ends up with a neighbour structure but no address handle (AH).
There's no mechanism to recover from this: any further packets sent to
this destination will be silently dumped in ipoib_start_xmit().

Fix this by freeing the neighbour structures when a path rec query
fails, so that the next packet queued to be sent will trigger a new path
record query.

Signed-off-by: Roland Dreier <roland@purestorage.com>
---
We actually hit this while testing SM failover.  Let me know if there
are any objections to merging this for 3.9.

 drivers/infiniband/ulp/ipoib/ipoib_main.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Or Gerlitz Feb. 26, 2013, 10:32 a.m. UTC | #1

On 25/02/2013 19:46, Roland Dreier wrote:
> If IPoIB fails to look up a path record (eg if it tries during an SM
> failover when one SM is dead but the new one hasn't taken over yet), the
> driver ends up with a neighbour structure but no address handle (AH).
> There's no mechanism to recover from this: any further packets sent to
> this destination will be silently dumped in ipoib_start_xmit().

Looking on the flow of sending ARP probes, I see that in 
unicast_arp_send, if there's no AH for the path, a path query is initiated,

   if (path->ah) {
                 ipoib_dbg(priv, "Send unicast ARP to %04x\n",
be16_to_cpu(path->pathrec.dlid));

                 spin_unlock_irqrestore(&priv->lock, flags);
                 ipoib_send(dev, skb, path->ah, IPOIB_QPN(cb->hwaddr));
                 return;
         } else if ((path->query || !path_rec_start(dev, path)) &&
                    skb_queue_len(&path->queue) < 
IPOIB_MAX_PATH_REC_QUEUE) {
                 __skb_queue_tail(&path->queue, skb);
         } else {
                 ++dev->stats.tx_dropped;
                 dev_kfree_skb_any(skb);
         }


so eventually the traffic should resume once the ND state machine sends 
a probe, agree? did you only wanted to make that faster?

Or.

> Fix this by freeing the neighbour structures when a path rec query fails, so that the next packet queued to be sent will trigger a new path record query.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Roland Dreier Feb. 26, 2013, 4:55 p.m. UTC | #2

On Tue, Feb 26, 2013 at 2:32 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> Looking on the flow of sending ARP probes, I see that in unicast_arp_send,
> if there's no AH for the path, a path query is initiated,
>
>   if (path->ah) {
>                 ipoib_dbg(priv, "Send unicast ARP to %04x\n",
> be16_to_cpu(path->pathrec.dlid));
>
>                 spin_unlock_irqrestore(&priv->lock, flags);
>                 ipoib_send(dev, skb, path->ah, IPOIB_QPN(cb->hwaddr));
>                 return;
>         } else if ((path->query || !path_rec_start(dev, path)) &&
>                    skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
>                 __skb_queue_tail(&path->queue, skb);
>         } else {
>                 ++dev->stats.tx_dropped;
>                 dev_kfree_skb_any(skb);
>         }
>
>
> so eventually the traffic should resume once the ND state machine sends a
> probe, agree? did you only wanted to make that faster?

Well, look at ipoib_start_xmit():

    /* unicast, arrange "switch" according to probability */
    switch (header->proto) {
    case htons(ETH_P_IP):
    case htons(ETH_P_IPV6):
        neigh = ipoib_neigh_get(dev, cb->hwaddr);
        if (unlikely(!neigh)) {
            neigh_add_path(skb, cb->hwaddr, dev);
            return NETDEV_TX_OK;
        }
        break;
    case htons(ETH_P_ARP):
    case htons(ETH_P_RARP):
        /* for unicast ARP and RARP should always perform path find */
        unicast_arp_send(skb, dev, cb);
        return NETDEV_TX_OK;

etc.  So although you mention ND, we only call unicast_arp_send() for
IPv4 ARP.  And in our case we were actually using IPv6, which follows
the main flow with neighbours even for neighbour discovery packets.
So there's no way to ever recover from having the failed path record.

In fact I bet this is why the bug has been there as long as it has
been: almost no one is using IPv6 on IPoIB seriously, and IPv4 should
work OK as you point out.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Sebastian Riemer Feb. 27, 2013, 10:32 a.m. UTC | #3

On 26.02.2013 17:55, Roland Dreier wrote:
[...]
> In fact I bet this is why the bug has been there as long as it has
> been: almost no one is using IPv6 on IPoIB seriously, and IPv4 should
> work OK as you point out.

Thanks a lot,.... Unfortunately, we are using IPoIB with IPv6 in
production for the inter-VM and the internet gateway traffic. We had an
issue last Friday were one of our gateway machines wasn't reachable
anymore and pings didn't come through. We had to reload the ib_ipoib
modules of nearly every server connected to that gateway. I'm so glad
that we use SRP - all SRP connections survived.

I don't have the time to care for IPoIB as well at ProfitBricks so I
will encourage our networking teams to get their hands on IPoIB and
linux-rdma communication.

How to verify if we've hit this bug? How to reproduce this bug?

Cheers,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff mbox

Patch

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 66d6da9..8534afd 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -510,6 +510,9 @@  static void path_rec_completion(int status,
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 
+	if (IS_ERR_OR_NULL(ah))
+		ipoib_del_neighs_by_gid(dev, path->pathrec.dgid.raw);
+
 	if (old_ah)
 		ipoib_put_ah(old_ah);