diff mbox

[9/9] NFSd: protect delegation setup with the i_lock

Message ID 1401455373-18207-10-git-send-email-jlayton@primarydata.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton May 30, 2014, 1:09 p.m. UTC
From: Trond Myklebust <trond.myklebust@primarydata.com>

state_lock is a heavily contended global lock. We don't want to grab
that while simultaneously holding the inode->i_lock. Avoid doing that in
the delegation break callback by ensuring that we add/remove the
dl_perfile under the inode->i_lock.

This however, can cause an ABBA deadlock between the state_lock and the
i_lock. Work around that by doing the list manipulation in the delegation
recall from the workqueue context prior to starting the rpc call.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
---
 fs/nfsd/nfs4callback.c | 18 ++++++++++++++++--
 fs/nfsd/nfs4state.c    | 41 +++++++++++++++++++++++++++--------------
 fs/nfsd/state.h        |  1 +
 3 files changed, 44 insertions(+), 16 deletions(-)

Comments

Christoph Hellwig June 2, 2014, 8:46 a.m. UTC | #1
On Fri, May 30, 2014 at 09:09:33AM -0400, Jeff Layton wrote:
> From: Trond Myklebust <trond.myklebust@primarydata.com>
> 
> state_lock is a heavily contended global lock. We don't want to grab
> that while simultaneously holding the inode->i_lock. Avoid doing that in
> the delegation break callback by ensuring that we add/remove the
> dl_perfile under the inode->i_lock.

The current code never takes the state lock (or recall lock) under
i_lock.  Non-trivial use of i_lock in nfsd is only added in this patch.

> 
> This however, can cause an ABBA deadlock between the state_lock and the
> i_lock. Work around that by doing the list manipulation in the delegation
> recall from the workqueue context prior to starting the rpc call.

I very much dislike the patch for multiple reasons.  For one thing
nfsd currently stays away from i_lock which is a VFS (and sometimes
abused by the fs) lock except for a single case in the file locking code
which really should be factored into common code.  I'd rather have
locks in nfsd than starting to use i_lock in nfsd and making auditing
it's use and it's lock hierchies more complex by introducing usage
outside of the VFS.

Second I really don't like pushing the delegation setup into the
callback workqueue.  I have a patchset refactoring the callback code
to allow adding more callbacks without lots of copy & paste, and except
for this new addition the code in the workqueue already is almost
perfectly generic for arbitrary callbacks.

Please add a lock in struct nfs4_file to protects it's own members
instead.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton June 2, 2014, 2:17 p.m. UTC | #2
On Mon, 2 Jun 2014 01:46:16 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, May 30, 2014 at 09:09:33AM -0400, Jeff Layton wrote:
> > From: Trond Myklebust <trond.myklebust@primarydata.com>
> > 
> > state_lock is a heavily contended global lock. We don't want to grab
> > that while simultaneously holding the inode->i_lock. Avoid doing that in
> > the delegation break callback by ensuring that we add/remove the
> > dl_perfile under the inode->i_lock.
> 
> The current code never takes the state lock (or recall lock) under
> i_lock.  Non-trivial use of i_lock in nfsd is only added in this patch.
> 
> > 
> > This however, can cause an ABBA deadlock between the state_lock and the
> > i_lock. Work around that by doing the list manipulation in the delegation
> > recall from the workqueue context prior to starting the rpc call.
> 
> I very much dislike the patch for multiple reasons.  For one thing
> nfsd currently stays away from i_lock which is a VFS (and sometimes
> abused by the fs) lock except for a single case in the file locking code
> which really should be factored into common code.  I'd rather have
> locks in nfsd than starting to use i_lock in nfsd and making auditing
> it's use and it's lock hierchies more complex by introducing usage
> outside of the VFS.
> 

Ok, fair enough. A later patch in the series adds a per nfs4_file lock
(the fi_lock) that we can use for this purpose in lieu of the i_lock.
Nesting that within the i_lock shouldn't be too painful.

> Second I really don't like pushing the delegation setup into the
> callback workqueue.  I have a patchset refactoring the callback code
> to allow adding more callbacks without lots of copy & paste, and except
> for this new addition the code in the workqueue already is almost
> perfectly generic for arbitrary callbacks.
> 
> Please add a lock in struct nfs4_file to protects it's own members
> instead.
> 

This is a problem however. The del_recall_lru list is a per-namespace
list, so we need a lock that is either global or per-namespace to add
things onto it. The point of this patch is to get that extra locking
out of the codepath where the i_lock is already held.

We could do that within the rpc_call_prepare operation, but the main
problem there is that the delegation could leak if the rpc task
allocation fails. Would you be amenable to adding a "cb_prepare"
operation or something that we could use to run things from the
workqueue before the actual callback runs?
Christoph Hellwig June 2, 2014, 6:20 p.m. UTC | #3
On Mon, Jun 02, 2014 at 10:17:33AM -0400, Jeff Layton wrote:
> Ok, fair enough. A later patch in the series adds a per nfs4_file lock
> (the fi_lock) that we can use for this purpose in lieu of the i_lock.
> Nesting that within the i_lock shouldn't be too painful.

i_lock is part of some very complex locking hierachies in the inode
and dentry caches.  Long term I'd prefer to get the file locking code
detangled from that, but let's keep it simple for now and nest
the nfs4_file lock inside for now.

> We could do that within the rpc_call_prepare operation, but the main
> problem there is that the delegation could leak if the rpc task
> allocation fails. Would you be amenable to adding a "cb_prepare"
> operation or something that we could use to run things from the
> workqueue before the actual callback runs?

I guess we'll have to do it then.  I'll see if it can be done nicely.

I suspect the root problem of all this is that the file locking code
calls the lock manager ops with i_lock held, which isn't very nice.

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton June 2, 2014, 6:48 p.m. UTC | #4
On Mon, 2 Jun 2014 11:20:26 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Jun 02, 2014 at 10:17:33AM -0400, Jeff Layton wrote:
> > Ok, fair enough. A later patch in the series adds a per nfs4_file lock
> > (the fi_lock) that we can use for this purpose in lieu of the i_lock.
> > Nesting that within the i_lock shouldn't be too painful.
> 
> i_lock is part of some very complex locking hierachies in the inode
> and dentry caches.  Long term I'd prefer to get the file locking code
> detangled from that, but let's keep it simple for now and nest
> the nfs4_file lock inside for now.
> 
> > We could do that within the rpc_call_prepare operation, but the main
> > problem there is that the delegation could leak if the rpc task
> > allocation fails. Would you be amenable to adding a "cb_prepare"
> > operation or something that we could use to run things from the
> > workqueue before the actual callback runs?
> 
> I guess we'll have to do it then.  I'll see if it can be done nicely.
> 

There is another option, but it's sort of ugly...

We could try to queue the thing in the rpc_call_prepare op, but if that
fails to occur then we could do it in the rpc_release callback. The
tricky part is figuring out whether we'll need to queue it in
rpc_release.

Perhaps we can try to use the dl_time field to indicate that as well so
we won't need to add a separate flag for it.

> I suspect the root problem of all this is that the file locking code
> calls the lock manager ops with i_lock held, which isn't very nice.
> 

Yeah, it would be nice if it didn't, but it's better than it was. It
used to use a global spinlock for that instead of the i_lock, but that
was also a clear scalability problem...
diff mbox

Patch

diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 2c73cae9899d..3d67bbf26cee 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -1011,9 +1011,8 @@  static void nfsd4_process_cb_update(struct nfsd4_callback *cb)
 		run_nfsd4_cb(cb);
 }
 
-static void nfsd4_do_callback_rpc(struct work_struct *w)
+static void nfsd4_run_callback_rpc(struct nfsd4_callback *cb)
 {
-	struct nfsd4_callback *cb = container_of(w, struct nfsd4_callback, cb_work);
 	struct nfs4_client *clp = cb->cb_clp;
 	struct rpc_clnt *clnt;
 
@@ -1031,11 +1030,25 @@  static void nfsd4_do_callback_rpc(struct work_struct *w)
 			cb->cb_ops, cb);
 }
 
+static void nfsd4_do_callback_rpc(struct work_struct *w)
+{
+	struct nfsd4_callback *cb = container_of(w, struct nfsd4_callback, cb_work);
+	nfsd4_run_callback_rpc(cb);
+}
+
 void nfsd4_init_callback(struct nfsd4_callback *cb)
 {
 	INIT_WORK(&cb->cb_work, nfsd4_do_callback_rpc);
 }
 
+static void nfsd4_do_cb_recall(struct work_struct *w)
+{
+	struct nfsd4_callback *cb = container_of(w, struct nfsd4_callback, cb_work);
+
+	nfsd4_prepare_cb_recall(cb->cb_op);
+	nfsd4_run_callback_rpc(cb);
+}
+
 void nfsd4_cb_recall(struct nfs4_delegation *dp)
 {
 	struct nfsd4_callback *cb = &dp->dl_recall;
@@ -1052,6 +1065,7 @@  void nfsd4_cb_recall(struct nfs4_delegation *dp)
 
 	INIT_LIST_HEAD(&cb->cb_per_client);
 	cb->cb_done = true;
+	INIT_WORK(&cb->cb_work, nfsd4_do_cb_recall);
 
 	run_nfsd4_cb(&dp->dl_recall);
 }
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 553c2d6d48dc..c249c5261640 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -438,7 +438,9 @@  hash_delegation_locked(struct nfs4_delegation *dp, struct nfs4_file *fp)
 	lockdep_assert_held(&state_lock);
 
 	dp->dl_stid.sc_type = NFS4_DELEG_STID;
+	spin_lock(&fp->fi_inode->i_lock);
 	list_add(&dp->dl_perfile, &fp->fi_delegations);
+	spin_unlock(&fp->fi_inode->i_lock);
 	list_add(&dp->dl_perclnt, &dp->dl_stid.sc_client->cl_delegations);
 }
 
@@ -446,14 +448,20 @@  hash_delegation_locked(struct nfs4_delegation *dp, struct nfs4_file *fp)
 static void
 unhash_delegation(struct nfs4_delegation *dp)
 {
+	struct nfs4_file *fp = dp->dl_file;
+
 	spin_lock(&state_lock);
 	list_del_init(&dp->dl_perclnt);
-	list_del_init(&dp->dl_perfile);
 	list_del_init(&dp->dl_recall_lru);
+	if (!list_empty(&dp->dl_perfile)) {
+		spin_lock(&fp->fi_inode->i_lock);
+		list_del_init(&dp->dl_perfile);
+		spin_unlock(&fp->fi_inode->i_lock);
+	}
 	spin_unlock(&state_lock);
-	if (dp->dl_file) {
-		nfs4_put_deleg_lease(dp->dl_file);
-		put_nfs4_file(dp->dl_file);
+	if (fp) {
+		nfs4_put_deleg_lease(fp);
+		put_nfs4_file(fp);
 		dp->dl_file = NULL;
 	}
 }
@@ -2766,24 +2774,31 @@  out:
 	return ret;
 }
 
-static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
+void nfsd4_prepare_cb_recall(struct nfs4_delegation *dp)
 {
 	struct nfs4_client *clp = dp->dl_stid.sc_client;
 	struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
 
-	lockdep_assert_held(&state_lock);
+	/*
+	 * We can't do this in nfsd_break_deleg_cb because it is
+	 * already holding inode->i_lock
+	 */
+	spin_lock(&state_lock);
+	if (list_empty(&dp->dl_recall_lru)) {
+		dp->dl_time = get_seconds();
+		list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);
+	}
+	spin_unlock(&state_lock);
+}
+
+static void nfsd_break_one_deleg(struct nfs4_delegation *dp)
+{
 	/* We're assuming the state code never drops its reference
 	 * without first removing the lease.  Since we're in this lease
 	 * callback (and since the lease code is serialized by the kernel
 	 * lock) we know the server hasn't removed the lease yet, we know
 	 * it's safe to take a reference: */
 	atomic_inc(&dp->dl_count);
-
-	list_add_tail(&dp->dl_recall_lru, &nn->del_recall_lru);
-
-	/* Only place dl_time is set; protected by i_lock: */
-	dp->dl_time = get_seconds();
-
 	nfsd4_cb_recall(dp);
 }
 
@@ -2808,11 +2823,9 @@  static void nfsd_break_deleg_cb(struct file_lock *fl)
 	 */
 	fl->fl_break_time = 0;
 
-	spin_lock(&state_lock);
 	fp->fi_had_conflict = true;
 	list_for_each_entry(dp, &fp->fi_delegations, dl_perfile)
 		nfsd_break_one_deleg(dp);
-	spin_unlock(&state_lock);
 }
 
 static
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 374c66283ac5..aacccb2a476c 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -472,6 +472,7 @@  extern void nfsd4_cb_recall(struct nfs4_delegation *dp);
 extern int nfsd4_create_callback_queue(void);
 extern void nfsd4_destroy_callback_queue(void);
 extern void nfsd4_shutdown_callback(struct nfs4_client *);
+extern void nfsd4_prepare_cb_recall(struct nfs4_delegation *dp);
 extern void nfs4_put_delegation(struct nfs4_delegation *dp);
 extern struct nfs4_client_reclaim *nfs4_client_to_reclaim(const char *name,
 							struct nfsd_net *nn);