diff mbox series

[04/39] lustre: ldlm: Do not hang if recovery restarted during lock replay

Message ID 1611249422-556-5-git-send-email-jsimmons@infradead.org (mailing list archive)
State New, archived
Headers show
Series lustre: update to latest OpenSFS version as of Jan 21 2021 | expand

Commit Message

James Simmons Jan. 21, 2021, 5:16 p.m. UTC
From: Oleg Drokin <green@whamcloud.com>

LU-13600 introduced lock ratelimiting logic, but it did not take
into account that if there's a disconnection in the REPLAY_LOCKS
phase then yet unsent locks get stuck in the sending queue so
the replay locks thread hangs with imp_replay_inflight elevated
above zero.

The direct consequence from that is recovery state machine never
advances from REPLAY to REPLAY_LOCKS status when imp_replay_inflight
is non zero.

Adjust __ldlm_replay_locks() to check if the import state changed
before attempting to send any more requests.

Add a testcase.

Fixes: 8cc7f22847 ("lustre: ptlrpc: limit rate of lock replays")
WC-bug-id: https://jira.whamcloud.com/browse/LU-14027
Lustre-commit: 7ca495ec67f474 ("LU-14027 ldlm: Do not hang if recovery restarted during lock replay")
Signed-off-by: Oleg Drokin <green@whamcloud.com>
Reviewed-on: https://review.whamcloud.com/40238
Reviewed-by: Mike Pershin <mpershin@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 fs/lustre/ldlm/ldlm_request.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)
diff mbox series

Patch

diff --git a/fs/lustre/ldlm/ldlm_request.c b/fs/lustre/ldlm/ldlm_request.c
index a2e1969..86b10a7 100644
--- a/fs/lustre/ldlm/ldlm_request.c
+++ b/fs/lustre/ldlm/ldlm_request.c
@@ -2271,9 +2271,12 @@  int __ldlm_replay_locks(struct obd_import *imp, bool rate_limit)
 		lock = list_first_entry(&list, struct ldlm_lock,
 					l_pending_chain);
 		list_del_init(&lock->l_pending_chain);
-		if (rc) {
+		/* If we disconnected in the middle - cleanup and let
+		 * reconnection to happen again. LU-14027
+		 */
+		if (rc || (imp->imp_state != LUSTRE_IMP_REPLAY_LOCKS)) {
 			LDLM_LOCK_RELEASE(lock);
-			continue; /* or try to do the rest? */
+			continue;
 		}
 		rc = replay_one_lock(imp, lock);
 		LDLM_LOCK_RELEASE(lock);