[432/622] lnet: handle unlink before send completes

lustre: sync closely to 2.13.52

James Simmons Feb. 27, 2020, 9:15 p.m.
From: Amir Shehata <ashehata@whamcloud.com>

If LNetMDUnlink() is called on an md with md->md_refcount > 0 then
the eq callback isn't called.
There is a scenario where the response times out before the send
completes. So we have a refcount on the MD. The Unlink callback gets
dropped on the floor. Send completes, but because we've already timed
out, the REPLY for the GET is dropped. Now we're left with a peer
that is in the following state:
But no more events are coming to it, and the discovery never

This scenario can get RPCs stuck as well if the response times out
before the send completes.

The solution is to set the event status to -ETIMEDOUT to inform
the send event handler that it should not expect a reply

WC-bug-id: https://jira.whamcloud.com/browse/LU-10931
Lustre-commit: d8fc5c23fe54 ("LU-10931 lnet: handle unlink before send completes")
Signed-off-by: Amir Shehata <ashehata@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/35444
Reviewed-by: Chris Horn <hornc@cray.com>
Reviewed-by: Alexandr Boyko <c17825@cray.com>
Reviewed-by: Olaf Weber <olaf.weber@hpe.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
 net/lnet/lnet/lib-msg.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 805d5b9..0d6c363 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -820,7 +820,12 @@ 
 	unlink = lnet_md_unlinkable(md);
 	if (md->md_eq) {
-		msg->msg_ev.status = status;
+		if ((md->md_flags & LNET_MD_FLAG_ABORTED) && !status) {
+			msg->msg_ev.status = -ETIMEDOUT;
+			CDEBUG(D_NET, "md 0x%p already unlinked\n", md);
+		} else {
+			msg->msg_ev.status = status;
+		}
 		msg->msg_ev.unlinked = unlink;
 		lnet_eq_enqueue_event(md->md_eq, &msg->msg_ev);