diff mbox

NFS: state manager thread must stay running.

Message ID 20140813140831.22f3e9c7@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown Aug. 13, 2014, 4:08 a.m. UTC
If the server restarts at an awkward time it is possible for write
requests to block waiting for the state manager to run.
If the state manager isn't already running a new thread will
need to be started which requires a GFP_KERNEL allocation
(for do_fork).

If memory is short, that GFP_KERNEL allocation could block on the
writes going out via NFS, resulting in a deadlock.

The easiest solution is to keep the manager thread running
always.

There are two interesting requirements for the manager thread:
1/ It must allow SIGKILL, which can abort NFS transactions to
   a dead server.
2/ It may continue running after the filesystem is unmounted,
   until the server recovers or the thread is SIGKILLed

These, particularly the last, make it unsuitable for kthread_worker,
which can only be killed synchronously (via kthread_stop()).

In this patch the manager thread no longer holds a counted reference
on the client. Instead a new flag NFS_CS_MANAGER indicates that the
thread is running and so holds an implicit reference.
nfs_put_client will only free the client if this flag is clear.
If it is set, the manager must free the client.

nn->nfs_client_lock is used to ensure nfs_put_client() and the
thread don't trip over each other at shutdown.

Signed-off-by: NeilBrown <neilb@suse.de>

Comments

Tejun Heo Aug. 17, 2014, 1:11 p.m. UTC | #1
Hello, Neil.

On Wed, Aug 13, 2014 at 02:08:31PM +1000, NeilBrown wrote:
> There are two interesting requirements for the manager thread:
> 1/ It must allow SIGKILL, which can abort NFS transactions to
>    a dead server.
> 2/ It may continue running after the filesystem is unmounted,
>    until the server recovers or the thread is SIGKILLed

Out of curiosity, why is SIGKILL handling necessary at all?  Can't nfs
just keep the manager running while any mount is active?

Thanks.
NeilBrown Aug. 17, 2014, 9:14 p.m. UTC | #2
On Sun, 17 Aug 2014 09:11:56 -0400 Tejun Heo <tj@kernel.org> wrote:

> Hello, Neil.
> 
> On Wed, Aug 13, 2014 at 02:08:31PM +1000, NeilBrown wrote:
> > There are two interesting requirements for the manager thread:
> > 1/ It must allow SIGKILL, which can abort NFS transactions to
> >    a dead server.
> > 2/ It may continue running after the filesystem is unmounted,
> >    until the server recovers or the thread is SIGKILLed
> 
> Out of curiosity, why is SIGKILL handling necessary at all?  Can't nfs
> just keep the manager running while any mount is active?

It does.  The manage will even continue after there are no mounts active if
it is blocked on a non-responding server.

If there is an outstanding RPC request to a non-responsive server
then the only way to abort that request is to send SIGKILL to the thread
which is waiting for the request.
So if we want things to clean up properly on shutdown it seems best for a
sigkill to be able to abort that thread.

It is quite likely that a deep re-write of various details could simplify
this.   There seems little point in the manager continuing after the lease
timeout has expired for example.  So there could be better ways to clean up.
However I think we probably do want the state manager to continue trying at
least until the lease time expires, so we cannot clean up the thread at
unmount time - it needs to persist at least a little while.

It is also possible that I'm missing some important details.  I really just
wanted to avoid the possible memory deadlock without breaking anything that
I didn't completely understand..  The proposed patch is the best I could do.

Thanks,
NeilBrown
Tejun Heo Aug. 18, 2014, 2:29 p.m. UTC | #3
Hello, Neil.

On Mon, Aug 18, 2014 at 07:14:57AM +1000, NeilBrown wrote:
> It does.  The manage will even continue after there are no mounts active if
> it is blocked on a non-responding server.
> 
> If there is an outstanding RPC request to a non-responsive server
> then the only way to abort that request is to send SIGKILL to the thread
> which is waiting for the request.
> So if we want things to clean up properly on shutdown it seems best for a
> sigkill to be able to abort that thread.

I see.  I'm not convinced SIGKILL is the right way to communicate
things like this tho.  If this is actually a thing which needs to be
exposed, we'd be a lot better off with an interface which is more to
the point rather than meddling with signals to the manager kernel
thread, which ultimately is an implementation detail.  Given the
requirement, I don't have objections to the patch but the requirement
seems rather lopsided.

Thanks.
Trond Myklebust Sept. 17, 2014, 1:43 a.m. UTC | #4
On Wed, Aug 13, 2014 at 12:08 AM, NeilBrown <neilb@suse.de> wrote:
>
>
> If the server restarts at an awkward time it is possible for write
> requests to block waiting for the state manager to run.
> If the state manager isn't already running a new thread will
> need to be started which requires a GFP_KERNEL allocation
> (for do_fork).
>
> If memory is short, that GFP_KERNEL allocation could block on the
> writes going out via NFS, resulting in a deadlock.
>
> The easiest solution is to keep the manager thread running
> always.

I'm still trying to figure out what to do about this patch. There are
2 concerns:

1) If we're so low on memory that we can't even start a state manager
thread, then how do we guarantee that the recovery can be completed?
We rely on that state manager thread being able to allocate memory to
perform the lease, session, open and lock recoveries.

2) Does allocating 1 thread per server really scale? Consider that
each thread is going to be completely idle almost all the time.
Wouldn't it be better to just preallocate only 1 thread, and then
allocating others dynamically if we really do hit a situation where 2
or more servers are rebooting at the same time?

Cheers
  Trond
NeilBrown Sept. 17, 2014, 3:05 a.m. UTC | #5
On Tue, 16 Sep 2014 21:43:07 -0400 Trond Myklebust
<trond.myklebust@primarydata.com> wrote:

> On Wed, Aug 13, 2014 at 12:08 AM, NeilBrown <neilb@suse.de> wrote:
> >
> >
> > If the server restarts at an awkward time it is possible for write
> > requests to block waiting for the state manager to run.
> > If the state manager isn't already running a new thread will
> > need to be started which requires a GFP_KERNEL allocation
> > (for do_fork).
> >
> > If memory is short, that GFP_KERNEL allocation could block on the
> > writes going out via NFS, resulting in a deadlock.
> >
> > The easiest solution is to keep the manager thread running
> > always.
> 
> I'm still trying to figure out what to do about this patch. There are
> 2 concerns:
> 
> 1) If we're so low on memory that we can't even start a state manager
> thread, then how do we guarantee that the recovery can be completed?
> We rely on that state manager thread being able to allocate memory to
> perform the lease, session, open and lock recoveries.

All the allocations performed by the state manager are (I assume) GFP_NOFS.
Creating a new thread requires GFP_KERNEL allocations, particularly in
dup_task_struct, which is called by kthreadd, which is well out of reach for
NFS to try to change the GFP flags.

Having said that, it occurs to me that my other dead-lock avoidance patch
might fix this problem as well.

The one case where I have seen a problem with starting the state manager, the
machine in question had several memory-shortage issues.  I think we finally
decided that a problem with too_many_isolated handling was the main cause.
So I cannot get very much reliable information from the stack trace there.  I
presume that in the current kernel, thread creation could deadlock against
nfs_release_page() (it was actually stuck in a congestion_wait()).

So I can't be certain, but I think this proactive thread creation won't be
needed once nfs_release_page() doesn't block indefinitely.

So you can drop this patch.

Thanks.

Though on the topic of patches that you don't know what to do with ....
Could you have a look at
   http://permalink.gmane.org/gmane.linux.nfs/56154
it appears that it slipped under your radar, and it fell of mine until just
recently.

NeilBrown
diff mbox

Patch

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 180d1ec9c32e..9973675737dd 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -272,13 +272,16 @@  void nfs_put_client(struct nfs_client *clp)
 	nn = net_generic(clp->cl_net, nfs_net_id);
 
 	if (atomic_dec_and_lock(&clp->cl_count, &nn->nfs_client_lock)) {
+		int manager_will_free =
+			test_bit(NFS_CS_MANAGER, &clp->cl_res_state);
 		list_del(&clp->cl_share_link);
 		nfs_cb_idr_remove_locked(clp);
-		spin_unlock(&nn->nfs_client_lock);
-
+		if (manager_will_free)
+			nfs4_schedule_state_manager(clp);
 		WARN_ON_ONCE(!list_empty(&clp->cl_superblocks));
-
-		clp->rpc_ops->free_client(clp);
+		spin_unlock(&nn->nfs_client_lock);
+		if (!manager_will_free)
+			clp->rpc_ops->free_client(clp);
 	}
 }
 EXPORT_SYMBOL_GPL(nfs_put_client);
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index ba2affa51941..014e730ee7a2 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -432,6 +432,7 @@  extern void nfs4_schedule_lease_recovery(struct nfs_client *);
 extern int nfs4_wait_clnt_recover(struct nfs_client *clp);
 extern int nfs4_client_recover_expired_lease(struct nfs_client *clp);
 extern void nfs4_schedule_state_manager(struct nfs_client *);
+extern int nfs4_start_state_manager(struct nfs_client *);
 extern void nfs4_schedule_path_down_recovery(struct nfs_client *clp);
 extern int nfs4_schedule_stateid_recovery(const struct nfs_server *, struct nfs4_state *);
 extern int nfs4_schedule_migration_recovery(const struct nfs_server *);
diff --git a/fs/nfs/nfs4client.c b/fs/nfs/nfs4client.c
index aa9ef4876046..907111174886 100644
--- a/fs/nfs/nfs4client.c
+++ b/fs/nfs/nfs4client.c
@@ -401,6 +401,11 @@  struct nfs_client *nfs4_init_client(struct nfs_client *clp,
 	}
 	__set_bit(NFS_CS_IDMAP, &clp->cl_res_state);
 
+	error = nfs4_start_state_manager(clp);
+	if (error < 0)
+		goto error;
+	__set_bit(NFS_CS_MANAGER, &clp->cl_res_state);
+
 	error = nfs4_init_client_minor_version(clp);
 	if (error < 0)
 		goto error;
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index 42f121182167..69d12decd04a 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -1150,15 +1150,12 @@  static void nfs4_clear_state_manager_bit(struct nfs_client *clp)
 /*
  * Schedule the nfs_client asynchronous state management routine
  */
-void nfs4_schedule_state_manager(struct nfs_client *clp)
+int nfs4_start_state_manager(struct nfs_client *clp)
 {
 	struct task_struct *task;
 	char buf[INET6_ADDRSTRLEN + sizeof("-manager") + 1];
 
-	if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
-		return;
 	__module_get(THIS_MODULE);
-	atomic_inc(&clp->cl_count);
 
 	/* The rcu_read_lock() is not strictly necessary, as the state
 	 * manager is the only thread that ever changes the rpc_xprt
@@ -1171,10 +1168,18 @@  void nfs4_schedule_state_manager(struct nfs_client *clp)
 	if (IS_ERR(task)) {
 		printk(KERN_ERR "%s: kthread_run: %ld\n",
 			__func__, PTR_ERR(task));
-		nfs4_clear_state_manager_bit(clp);
-		nfs_put_client(clp);
 		module_put(THIS_MODULE);
+		return PTR_ERR(task);
 	}
+	return 0;
+}
+
+void nfs4_schedule_state_manager(struct nfs_client *clp)
+{
+	if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
+		return;
+	smp_mb__after_atomic();
+	wake_up_bit(&clp->cl_state, NFS4CLNT_MANAGER_RUNNING);
 }
 
 /*
@@ -2328,8 +2333,12 @@  static void nfs4_state_manager(struct nfs_client *clp)
 	int status = 0;
 	const char *section = "", *section_sep = "";
 
-	/* Ensure exclusive access to NFSv4 state */
-	do {
+	while (atomic_read(&clp->cl_count) && clp->cl_state) {
+		/* If we found more work to do, we must ensure the
+		 * bit is set so other code can wait for it.
+		 */
+		set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state);
+
 		if (test_bit(NFS4CLNT_PURGE_STATE, &clp->cl_state)) {
 			section = "purge state";
 			status = nfs4_purge_lease(clp);
@@ -2423,12 +2432,7 @@  static void nfs4_state_manager(struct nfs_client *clp)
 		}
 
 		nfs4_clear_state_manager_bit(clp);
-		/* Did we race with an attempt to give us more work? */
-		if (clp->cl_state == 0)
-			break;
-		if (test_and_set_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state) != 0)
-			break;
-	} while (atomic_read(&clp->cl_count) > 1);
+	}
 	return;
 out_error:
 	if (strlen(section))
@@ -2444,10 +2448,23 @@  out_error:
 static int nfs4_run_state_manager(void *ptr)
 {
 	struct nfs_client *clp = ptr;
+	struct nfs_net *nn;
 
 	allow_signal(SIGKILL);
-	nfs4_state_manager(clp);
-	nfs_put_client(clp);
+	while (atomic_read(&clp->cl_count)) {
+		if (signal_pending(current))
+			flush_signals(current);
+		wait_event_interruptible(
+			*bit_waitqueue(&clp->cl_state,
+				       NFS4CLNT_MANAGER_RUNNING),
+			test_bit(NFS4CLNT_MANAGER_RUNNING, &clp->cl_state));
+		nfs4_state_manager(clp);
+	}
+	nn = net_generic(clp->cl_net, nfs_net_id);
+	spin_lock(&nn->nfs_client_lock);
+	/* nfs_put_client has definitely stopped using its reference */
+	spin_unlock(&nn->nfs_client_lock);
+	clp->rpc_ops->free_client(clp);
 	module_put_and_exit(0);
 	return 0;
 }
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 1150ea41b626..40c7d1a53388 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -36,6 +36,7 @@  struct nfs_client {
 #define NFS_CS_RENEWD		3		/* - renewd started */
 #define NFS_CS_STOP_RENEW	4		/* no more state to renew */
 #define NFS_CS_CHECK_LEASE_TIME	5		/* need to check lease time */
+#define NFS_CS_MANAGER		6		/* NFSv4 manager thread running */
 	unsigned long		cl_flags;	/* behavior switches */
 #define NFS_CS_NORESVPORT	0		/* - use ephemeral src port */
 #define NFS_CS_DISCRTRY		1		/* - disconnect on RPC retry */