From patchwork Thu Jun 27 18:30:27 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6buE5LmQ?= <huangle1@jd.com>
X-Patchwork-Id: 11020267
Return-Path: <linux-nfs-owner@kernel.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
 [172.30.200.125])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6AE8814E5
	for <patchwork-linux-nfs@patchwork.kernel.org>;
 Thu, 27 Jun 2019 18:30:33 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 55699282E8
	for <patchwork-linux-nfs@patchwork.kernel.org>;
 Thu, 27 Jun 2019 18:30:33 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 491822847B; Thu, 27 Jun 2019 18:30:33 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,FROM_EXCESS_BASE64,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2FBE2282E8
	for <patchwork-linux-nfs@patchwork.kernel.org>;
 Thu, 27 Jun 2019 18:30:32 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726441AbfF0Sab (ORCPT
        <rfc822;patchwork-linux-nfs@patchwork.kernel.org>);
        Thu, 27 Jun 2019 14:30:31 -0400
Received: from smtp3.jd.com ([59.151.64.88]:2124 "EHLO smtp3.jd.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1726539AbfF0Sab (ORCPT <rfc822;linux-nfs@vger.kernel.org>);
        Thu, 27 Jun 2019 14:30:31 -0400
Received: from BJMAILD1MBX38.360buyAD.local (172.31.0.38) by
 BJMAILD1MBX49.360buyAD.local (172.31.0.49) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id
 15.1.1466.3; Fri, 28 Jun 2019 02:30:27 +0800
Received: from BJMAILD1MBX36.360buyAD.local (172.31.0.36) by
 BJMAILD1MBX38.360buyAD.local (172.31.0.38) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id
 15.1.1415.2; Fri, 28 Jun 2019 02:30:27 +0800
Received: from BJMAILD1MBX36.360buyAD.local ([fe80::2116:e90b:d89d:e893]) by
 BJMAILD1MBX36.360buyAD.local ([fe80::2116:e90b:d89d:e893%24]) with mapi id
 15.01.1415.002; Fri, 28 Jun 2019 02:30:27 +0800
From: =?eucgb2312_cn?b?u8bA1g==?= <huangle1@jd.com>
To: "bfields@fieldses.org" <bfields@fieldses.org>,
        "jlayton@kernel.org" <jlayton@kernel.org>,
        "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: [PATCH] nfsd4: fix a deadlock on state owner replay mutex
Thread-Topic: [PATCH] nfsd4: fix a deadlock on state owner replay mutex
Thread-Index: AQHVLRYrJwxAvKxXXkKWwBVCcQfpAw==
Date: Thu, 27 Jun 2019 18:30:27 +0000
Message-ID: <720b91b1204b4c73be1b6ec2ff44dbab@jd.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.31.14.12]
MIME-Version: 1.0
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-nfs.vger.kernel.org>
X-Mailing-List: linux-nfs@vger.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

from: Huang Le <huangle1@jd.com>

In move_to_close_lru(), which only be called on path of nfsd4 CLOSE op,
the code could wait for its stid ref count drop to 2 while holding its
state owner replay mutex.  However, the other stid ref holder (normally
a parallel CLOSE op) that move_to_close_lru() is waiting for might be
accquiring the same replay mutex.

This patch fix the issue by clearing the replay owner before waiting, and
assign it back after then.

Signed-off-by: Huang Le <huangle1@jd.com>
---

I guess we should cc this patch to stable tree, since a malicious client
could craft parallel CLOSE ops to put all nfsd tasks in D state shortly.

diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 618e660..5f6a48f 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -3829,12 +3829,12 @@ static void nfs4_free_openowner(struct nfs4_stateowner *so)
  * them before returning however.
  */
 static void
-move_to_close_lru(struct nfs4_ol_stateid *s, struct net *net)
+move_to_close_lru(struct nfsd4_compound_state *cstate, struct nfs4_ol_stateid *s,
+		struct net *net)
 {
 	struct nfs4_ol_stateid *last;
 	struct nfs4_openowner *oo = openowner(s->st_stateowner);
-	struct nfsd_net *nn = net_generic(s->st_stid.sc_client->net,
-						nfsd_net_id);
+	struct nfsd_net *nn = net_generic(net, nfsd_net_id);
 
 	dprintk("NFSD: move_to_close_lru nfs4_openowner %p\n", oo);
 
@@ -3846,8 +3846,19 @@ static void nfs4_free_openowner(struct nfs4_stateowner *so)
 	 * Wait for the refcount to drop to 2. Since it has been unhashed,
 	 * there should be no danger of the refcount going back up again at
 	 * this point.
+	 *
+	 * Before waiting, we clear cstate->replay_owner to release its
+	 * so_replay.rp_mutex, since other reference holder might be accquiring
+	 * the same mutex before they could drop the references.  The replay_owner
+	 * can be assigned back safely after they done their jobs.
 	 */
-	wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2);
+	if (refcount_read(&s->st_stid.sc_count) != 2) {
+		struct nfs4_stateowner *so = cstate->replay_owner;
+
+		nfsd4_cstate_clear_replay(cstate);
+		wait_event(close_wq, refcount_read(&s->st_stid.sc_count) == 2);
+		nfsd4_cstate_assign_replay(cstate, so);
+	}
 
 	release_all_access(s);
 	if (s->st_stid.sc_file) {
@@ -5531,7 +5542,8 @@ static inline void nfs4_stateid_downgrade(struct nfs4_ol_stateid *stp, u32 to_ac
 	return status;
 }
 
-static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
+static void nfsd4_close_open_stateid(struct nfsd4_compound_state *cstate,
+		struct nfs4_ol_stateid *s)
 {
 	struct nfs4_client *clp = s->st_stid.sc_client;
 	bool unhashed;
@@ -5549,7 +5561,7 @@ static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
 		spin_unlock(&clp->cl_lock);
 		free_ol_stateid_reaplist(&reaplist);
 		if (unhashed)
-			move_to_close_lru(s, clp->net);
+			move_to_close_lru(cstate, s, clp->net);
 	}
 }
 
@@ -5587,7 +5599,7 @@ static void nfsd4_close_open_stateid(struct nfs4_ol_stateid *s)
 	 */
 	nfs4_inc_and_copy_stateid(&close->cl_stateid, &stp->st_stid);
 
-	nfsd4_close_open_stateid(stp);
+	nfsd4_close_open_stateid(cstate, stp);
 	mutex_unlock(&stp->st_mutex);
 
 	/* v4.1+ suggests that we send a special stateid in here, since the