From patchwork Thu Feb 27 21:15:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Simmons X-Patchwork-Id: 11410407 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0907717E0 for ; Thu, 27 Feb 2020 21:37:35 +0000 (UTC) Received: from pdx1-mailman02.dreamhost.com (pdx1-mailman02.dreamhost.com [64.90.62.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E5FA624690 for ; Thu, 27 Feb 2020 21:37:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E5FA624690 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lustre-devel-bounces@lists.lustre.org Received: from pdx1-mailman02.dreamhost.com (localhost [IPv6:::1]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id 32874349600; Thu, 27 Feb 2020 13:30:54 -0800 (PST) X-Original-To: lustre-devel@lists.lustre.org Delivered-To: lustre-devel-lustre.org@pdx1-mailman02.dreamhost.com Received: from smtp3.ccs.ornl.gov (smtp3.ccs.ornl.gov [160.91.203.39]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id C6C6021FEB2 for ; Thu, 27 Feb 2020 13:20:32 -0800 (PST) Received: from star.ccs.ornl.gov (star.ccs.ornl.gov [160.91.202.134]) by smtp3.ccs.ornl.gov (Postfix) with ESMTP id 5E5EB8F28; Thu, 27 Feb 2020 16:18:18 -0500 (EST) Received: by star.ccs.ornl.gov (Postfix, from userid 2004) id 5C7D446A; Thu, 27 Feb 2020 16:18:18 -0500 (EST) From: James Simmons To: Andreas Dilger , Oleg Drokin , NeilBrown Date: Thu, 27 Feb 2020 16:15:00 -0500 Message-Id: <1582838290-17243-433-git-send-email-jsimmons@infradead.org> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> References: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> Subject: [lustre-devel] [PATCH 432/622] lnet: handle unlink before send completes X-BeenThere: lustre-devel@lists.lustre.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "For discussing Lustre software development." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Amir Shehata , Lustre Development List MIME-Version: 1.0 Errors-To: lustre-devel-bounces@lists.lustre.org Sender: "lustre-devel" From: Amir Shehata If LNetMDUnlink() is called on an md with md->md_refcount > 0 then the eq callback isn't called. There is a scenario where the response times out before the send completes. So we have a refcount on the MD. The Unlink callback gets dropped on the floor. Send completes, but because we've already timed out, the REPLY for the GET is dropped. Now we're left with a peer that is in the following state: LNET_PEER_MULTI_RAIL LNET_PEER_DISCOVERING LNET_PEER_PING_SENT But no more events are coming to it, and the discovery never completes. This scenario can get RPCs stuck as well if the response times out before the send completes. The solution is to set the event status to -ETIMEDOUT to inform the send event handler that it should not expect a reply WC-bug-id: https://jira.whamcloud.com/browse/LU-10931 Lustre-commit: d8fc5c23fe54 ("LU-10931 lnet: handle unlink before send completes") Signed-off-by: Amir Shehata Reviewed-on: https://review.whamcloud.com/35444 Reviewed-by: Chris Horn Reviewed-by: Alexandr Boyko Reviewed-by: Olaf Weber Reviewed-by: Oleg Drokin Signed-off-by: James Simmons --- net/lnet/lnet/lib-msg.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c index 805d5b9..0d6c363 100644 --- a/net/lnet/lnet/lib-msg.c +++ b/net/lnet/lnet/lib-msg.c @@ -820,7 +820,12 @@ unlink = lnet_md_unlinkable(md); if (md->md_eq) { - msg->msg_ev.status = status; + if ((md->md_flags & LNET_MD_FLAG_ABORTED) && !status) { + msg->msg_ev.status = -ETIMEDOUT; + CDEBUG(D_NET, "md 0x%p already unlinked\n", md); + } else { + msg->msg_ev.status = status; + } msg->msg_ev.unlinked = unlink; lnet_eq_enqueue_event(md->md_eq, &msg->msg_ev); }