From patchwork Thu Jul 2 00:04:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Simmons X-Patchwork-Id: 11637583 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60F1A92A for ; Thu, 2 Jul 2020 00:05:27 +0000 (UTC) Received: from pdx1-mailman02.dreamhost.com (pdx1-mailman02.dreamhost.com [64.90.62.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 49372207E8 for ; Thu, 2 Jul 2020 00:05:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 49372207E8 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lustre-devel-bounces@lists.lustre.org Received: from pdx1-mailman02.dreamhost.com (localhost [IPv6:::1]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id 11F4521FF61; Wed, 1 Jul 2020 17:05:16 -0700 (PDT) X-Original-To: lustre-devel@lists.lustre.org Delivered-To: lustre-devel-lustre.org@pdx1-mailman02.dreamhost.com Received: from smtp3.ccs.ornl.gov (smtp3.ccs.ornl.gov [160.91.203.39]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id 73D6A21FD45 for ; Wed, 1 Jul 2020 17:05:08 -0700 (PDT) Received: from star.ccs.ornl.gov (star.ccs.ornl.gov [160.91.202.134]) by smtp3.ccs.ornl.gov (Postfix) with ESMTP id 6FF1637F; Wed, 1 Jul 2020 20:05:02 -0400 (EDT) Received: by star.ccs.ornl.gov (Postfix, from userid 2004) id 6DE282BA; Wed, 1 Jul 2020 20:05:02 -0400 (EDT) From: James Simmons To: Andreas Dilger , Oleg Drokin , NeilBrown Date: Wed, 1 Jul 2020 20:04:52 -0400 Message-Id: <1593648298-10571-13-git-send-email-jsimmons@infradead.org> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1593648298-10571-1-git-send-email-jsimmons@infradead.org> References: <1593648298-10571-1-git-send-email-jsimmons@infradead.org> Subject: [lustre-devel] [PATCH 12/18] lnet: Skip health and resends for single rail configs X-BeenThere: lustre-devel@lists.lustre.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "For discussing Lustre software development." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Lustre Development List MIME-Version: 1.0 Errors-To: lustre-devel-bounces@lists.lustre.org Sender: "lustre-devel" From: Chris Horn If the sender of a message only has a single interface it doesn't make sense to have LNet track the health of that interface, nor should it attempt to resend a message when it encounters a local error. There aren't any alternative interfaces to use for a resend. Similarly, we needn't track health values of a peer's NIs if the peer only has a single interface. Nor do we need to attempt to resend a message to a peer with a single interface. There's an exception for routers. We rely on NI health to determine route aliveness, so even if a router only has a single interface we still need to track its health. We can use the ln_ping_target to get the count of local NIs, and the lnet_peer struct already contains a count of the number of peer NIs. HPE-bug-id: LUS-8826 WC-bug-id: https://jira.whamcloud.com/browse/LU-13501 Lustre-commit: c5381d73b1d83 ("LU-13501 lnet: Skip health and resends for single rail configs") Signed-off-by: Chris Horn Reviewed-on: https://review.whamcloud.com/38448 Reviewed-by: Serguei Smirnov Reviewed-by: Alexander Boyko Reviewed-by: Oleg Drokin Signed-off-by: James Simmons --- net/lnet/lnet/lib-msg.c | 65 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 46 insertions(+), 19 deletions(-) diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c index 7ce9c47..f759b2d 100644 --- a/net/lnet/lnet/lib-msg.c +++ b/net/lnet/lnet/lib-msg.c @@ -774,6 +774,10 @@ struct lnet_peer_ni *lpni; struct lnet_ni *ni; bool lo = false; + bool attempt_local_resend; + bool attempt_remote_resend; + bool handle_local_health; + bool handle_remote_health; /* if we're shutting down no point in handling health. */ if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) @@ -800,9 +804,45 @@ if (msg->msg_tx_committed) { ni = msg->msg_txni; lpni = msg->msg_txpeer; + attempt_local_resend = true; + attempt_remote_resend = true; } else { ni = msg->msg_rxni; lpni = msg->msg_rxpeer; + attempt_local_resend = false; + attempt_remote_resend = false; + } + + /* Don't further decrement the health value if a recovery message + * failed. + */ + if (msg->msg_recovery) { + handle_local_health = false; + handle_remote_health = false; + } else { + handle_local_health = false; + handle_remote_health = true; + } + + /* For local failures, health/recovery/resends are not needed if I only + * have a single (non-lolnd) interface. NB: pb_nnis includes the lolnd + * interface, so a single-rail node would have pb_nnis == 2. + */ + if (the_lnet.ln_ping_target->pb_nnis <= 2) { + handle_local_health = false; + attempt_local_resend = false; + } + + /* For remote failures, health/recovery/resends are not needed if the + * peer only has a single interface. Special case for routers where we + * rely on health feature to manage route aliveness. NB: unlike pb_nnis + * above, lp_nnis does _not_ include the lolnd, so a single-rail node + * would have lp_nnis == 1. + */ + if (lpni && lpni->lpni_peer_net->lpn_peer->lp_nnis <= 1) { + attempt_remote_resend = false; + if (!lnet_isrouter(lpni)) + handle_remote_health = false; } if (!lo) @@ -865,41 +905,28 @@ case LNET_MSG_STATUS_LOCAL_ABORTED: case LNET_MSG_STATUS_LOCAL_NO_ROUTE: case LNET_MSG_STATUS_LOCAL_TIMEOUT: - /* don't further decrement the health value if the - * recovery message failed. - */ - if (!msg->msg_recovery) + if (handle_local_health) lnet_handle_local_failure(ni); - if (msg->msg_tx_committed) - /* add to the re-send queue */ + if (attempt_local_resend) return lnet_attempt_msg_resend(msg); break; - /* These errors will not trigger a resend so simply - * finalize the message - */ case LNET_MSG_STATUS_LOCAL_ERROR: - /* don't further decrement the health value if the - * recovery message failed. - */ - if (!msg->msg_recovery) + if (handle_local_health) lnet_handle_local_failure(ni); return -1; - /* TODO: since the remote dropped the message we can - * attempt a resend safely. - */ case LNET_MSG_STATUS_REMOTE_DROPPED: - if (!msg->msg_recovery) + if (handle_remote_health) lnet_handle_remote_failure(lpni); - if (msg->msg_tx_committed) + if (attempt_remote_resend) return lnet_attempt_msg_resend(msg); break; case LNET_MSG_STATUS_REMOTE_ERROR: case LNET_MSG_STATUS_REMOTE_TIMEOUT: case LNET_MSG_STATUS_NETWORK_TIMEOUT: - if (!msg->msg_recovery) + if (handle_remote_health) lnet_handle_remote_failure(lpni); return -1; default: