NFS: avoid deadlock in nfs_kill_super

Message ID	1351188163-10067-1-git-send-email-dros@netapp.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> From: Weston Andros Adamson <dros@netapp.com> To: Trond.Myklebust@netapp.com Cc: linux-nfs@vger.kernel.org, Weston Andros Adamson <dros@netapp.com> Subject: [PATCH] NFS: avoid deadlock in nfs_kill_super Date: Thu, 25 Oct 2012 14:02:43 -0400 Message-Id: <1351188163-10067-1-git-send-email-dros@netapp.com> Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Message ID

1351188163-10067-1-git-send-email-dros@netapp.com (mailing list archive)

State

New, archived

Headers

From: Weston Andros Adamson <dros@netapp.com>
To: Trond.Myklebust@netapp.com
Cc: linux-nfs@vger.kernel.org, Weston Andros Adamson <dros@netapp.com>
Subject: [PATCH] NFS: avoid deadlock in nfs_kill_super
Date: Thu, 25 Oct 2012 14:02:43 -0400
Message-Id: <1351188163-10067-1-git-send-email-dros@netapp.com>
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk

Commit Message

Weston Andros Adamson Oct. 25, 2012, 6:02 p.m. UTC

Calling nfs_kill_super from an RPC task callback would result in a deadlock
where nfs_free_server (via rpc_shutdown_client) tries to kill all
RPC tasks associated with that connection - including itself!

Instead of calling nfs_kill_super directly, queue a job on the nfsiod
workqueue.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
---

This fixes the current incarnation of the lockup I've been tracking down for
some time now.  I still have to go back and see why the reproducer changed
behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but
now (before this patch) are stuck in rpc_exit.

The reproducer works against a server with write delegations:

./nfsometer.py -m v4 server:/path dd_100m_100k

which is basically:
 - mount
 - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024
 - umount
 - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this
   case it NEVER goes away)

There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the
done handler calls nfs_sb_deactivate...

I've tested this approach with 10 runs X 3 nfs versions X 5 workloads 
(dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident
its correct.

One question for the list: should nfs_free_server *always* be scheduled on
the nfsiod workqueue? It's called in error paths in several locations.
After looking at them, I don't think my approach would break anything, but 
some might have style objections.

 -dros

 fs/nfs/client.c           |   20 +++++++++++++++++---
 include/linux/nfs_fs_sb.h |    1 +
 2 files changed, 18 insertions(+), 3 deletions(-)

Comments

Trond Myklebust Oct. 25, 2012, 6:17 p.m. UTC | #1

On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote:
> Calling nfs_kill_super from an RPC task callback would result in a deadlock

> where nfs_free_server (via rpc_shutdown_client) tries to kill all

> RPC tasks associated with that connection - including itself!

> 

> Instead of calling nfs_kill_super directly, queue a job on the nfsiod

> workqueue.

> 

> Signed-off-by: Weston Andros Adamson <dros@netapp.com>

> ---

> 

> This fixes the current incarnation of the lockup I've been tracking down for

> some time now.  I still have to go back and see why the reproducer changed

> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but

> now (before this patch) are stuck in rpc_exit.

> 

> The reproducer works against a server with write delegations:

> 

> ./nfsometer.py -m v4 server:/path dd_100m_100k

> 

> which is basically:

>  - mount

>  - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024

>  - umount

>  - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this

>    case it NEVER goes away)

> 

> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the

> done handler calls nfs_sb_deactivate...

> 

> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads 

> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident

> its correct.

> 

> One question for the list: should nfs_free_server *always* be scheduled on

> the nfsiod workqueue? It's called in error paths in several locations.

> After looking at them, I don't think my approach would break anything, but 

> some might have style objections.

> 


This doesn't add up. There should be nothing calling nfs_sb_deactive()
from a rpc_call_done() callback. If so, then that would be the bug.

All calls to things like rpc_put_task(), put_nfs_open_context(), dput(),
or nfs_sb_deactive() should occur in the rpc_call_release() callback if
they can't be done in a process context. In both those cases, the
rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Adamson, Dros Oct. 25, 2012, 7:22 p.m. UTC | #2

On Oct 25, 2012, at 2:17 PM, "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote:
>> Calling nfs_kill_super from an RPC task callback would result in a deadlock
>> where nfs_free_server (via rpc_shutdown_client) tries to kill all
>> RPC tasks associated with that connection - including itself!
>> 
>> Instead of calling nfs_kill_super directly, queue a job on the nfsiod
>> workqueue.
>> 
>> Signed-off-by: Weston Andros Adamson <dros@netapp.com>
>> ---
>> 
>> This fixes the current incarnation of the lockup I've been tracking down for
>> some time now.  I still have to go back and see why the reproducer changed
>> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but
>> now (before this patch) are stuck in rpc_exit.
>> 
>> The reproducer works against a server with write delegations:
>> 
>> ./nfsometer.py -m v4 server:/path dd_100m_100k
>> 
>> which is basically:
>> - mount
>> - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024
>> - umount
>> - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this
>>   case it NEVER goes away)
>> 
>> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the
>> done handler calls nfs_sb_deactivate...
>> 
>> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads 
>> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident
>> its correct.
>> 
>> One question for the list: should nfs_free_server *always* be scheduled on
>> the nfsiod workqueue? It's called in error paths in several locations.
>> After looking at them, I don't think my approach would break anything, but 
>> some might have style objections.
>> 
> 
> This doesn't add up. There should be nothing calling nfs_sb_deactive()
> from a rpc_call_done() callback. If so, then that would be the bug.
> 
> All calls to things like rpc_put_task(), put_nfs_open_context(), dput(),
> or nfs_sb_deactive() should occur in the rpc_call_release() callback if
> they can't be done in a process context. In both those cases, the
> rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client.

Ah, I misunderstood what was going on here.

nfs_kill_super *is* being called by rpc_release_calldata callback:

The kworker stuck in rpc_killall_tasks forever:

[   34.552600]  [<ffffffffa00868e6>] rpc_killall_tasks+0x2d/0xcd [sunrpc]
[   34.552608]  [<ffffffffa00883e4>] rpc_shutdown_client+0x4a/0xec [sunrpc]
[   34.552615]  [<ffffffffa002b973>] nfs_free_server+0xcf/0x133 [nfs]
[   34.552625]  [<ffffffffa0033193>] nfs_kill_super+0x37/0x3c [nfs]
[   34.552629]  [<ffffffff81136c68>] deactivate_locked_super+0x37/0x63
[   34.552633]  [<ffffffff8113785f>] deactivate_super+0x37/0x3b
[   34.552642]  [<ffffffffa0034fc1>] nfs_sb_deactive+0x23/0x25 [nfs]
[   34.552649]  [<ffffffffa00dbba2>] nfs4_free_closedata+0x53/0x63 [nfsv4]
[   34.552661]  [<ffffffffa008f997>] rpc_release_calldata+0x17/0x19 [sunrpc]
[   34.552671]  [<ffffffffa008f9f5>] rpc_free_task+0x5c/0x65 [sunrpc]
[   34.552680]  [<ffffffffa008fe07>] rpc_async_release+0x15/0x17 [sunrpc]
[   34.552684]  [<ffffffff810632b7>] process_one_work+0x192/0x2a0
[   34.552693]  [<ffffffffa008fdf2>] ? rpc_async_schedule+0x33/0x33 [sunrpc]
[   34.552697]  [<ffffffff81064169>] worker_thread+0x140/0x1d7
[   34.552700]  [<ffffffff81064029>] ? manage_workers+0x23b/0x23b
[   34.552704]  [<ffffffff81067d21>] kthread+0x8d/0x95
[   34.552708]  [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43
[   34.552713]  [<ffffffff814ef1ac>] ret_from_fork+0x7c/0xb0
[   34.552717]  [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43

And the client's task list:

[  174.574006] -pid- flgs status -client- --rqstp- -timeout ---ops--
[  174.574019]  1664 0181     -5 ffff880226474600   (null)        0 ffffffffa00f7ce0 nfsv4 DELEGRETURN a:rpc_exit_task [sunrpc] q:none

So it looks like a CLOSE's rpc_release_calldata is triggering the nfs_kill_super and this is stuck trying to kill the DELEGRETURN task - which never gets run.

I've debugged this from the workqueue side and the DELEGRETURN work is scheduled, but ends up having insert_wq_barrier called on it.  It seems to me that this means the work queue is enforcing queue ordering that the CLOSE work should complete before the DELEGRETURN work can proceed -- and *that* is the deadlock (CLOSE waiting until DELEGRETURN is dead, DELEGRETURN can't run until close is complete).

This would also explain our (Trond and me) failed attempts at canceling / rescheduling jobs in killall_tasks -- insert_wq_barrier's comment states:

 * Currently, a queued barrier can't be canceled.  This is because
 * try_to_grab_pending() can't determine whether the work to be
 * grabbed is at the head of the queue and thus can't clear LINKED
 * flag of the previous work while there must be a valid next work
 * after a work with LINKED flag set.

Now that I have a better understanding of what's happening, I'll go back to the drawing board.

Thanks!
-dros--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Adamson, Dros Oct. 25, 2012, 7:22 p.m. UTC | #3

On Oct 25, 2012, at 2:17 PM, "Myklebust, Trond" <Trond.Myklebust@netapp.com> wrote:

> On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote:
>> Calling nfs_kill_super from an RPC task callback would result in a deadlock
>> where nfs_free_server (via rpc_shutdown_client) tries to kill all
>> RPC tasks associated with that connection - including itself!
>> 
>> Instead of calling nfs_kill_super directly, queue a job on the nfsiod
>> workqueue.
>> 
>> Signed-off-by: Weston Andros Adamson <dros@netapp.com>
>> ---
>> 
>> This fixes the current incarnation of the lockup I've been tracking down for
>> some time now.  I still have to go back and see why the reproducer changed
>> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but
>> now (before this patch) are stuck in rpc_exit.
>> 
>> The reproducer works against a server with write delegations:
>> 
>> ./nfsometer.py -m v4 server:/path dd_100m_100k
>> 
>> which is basically:
>> - mount
>> - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024
>> - umount
>> - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this
>>  case it NEVER goes away)
>> 
>> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the
>> done handler calls nfs_sb_deactivate...
>> 
>> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads 
>> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident
>> its correct.
>> 
>> One question for the list: should nfs_free_server *always* be scheduled on
>> the nfsiod workqueue? It's called in error paths in several locations.
>> After looking at them, I don't think my approach would break anything, but 
>> some might have style objections.
>> 
> 
> This doesn't add up. There should be nothing calling nfs_sb_deactive()
> from a rpc_call_done() callback. If so, then that would be the bug.
> 
> All calls to things like rpc_put_task(), put_nfs_open_context(), dput(),
> or nfs_sb_deactive() should occur in the rpc_call_release() callback if
> they can't be done in a process context. In both those cases, the
> rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client.

Ah, I misunderstood what was going on here.

nfs_kill_super *is* being called by rpc_release_calldata callback:

The kworker stuck in rpc_killall_tasks forever:

[   34.552600]  [<ffffffffa00868e6>] rpc_killall_tasks+0x2d/0xcd [sunrpc]
[   34.552608]  [<ffffffffa00883e4>] rpc_shutdown_client+0x4a/0xec [sunrpc]
[   34.552615]  [<ffffffffa002b973>] nfs_free_server+0xcf/0x133 [nfs]
[   34.552625]  [<ffffffffa0033193>] nfs_kill_super+0x37/0x3c [nfs]
[   34.552629]  [<ffffffff81136c68>] deactivate_locked_super+0x37/0x63
[   34.552633]  [<ffffffff8113785f>] deactivate_super+0x37/0x3b
[   34.552642]  [<ffffffffa0034fc1>] nfs_sb_deactive+0x23/0x25 [nfs]
[   34.552649]  [<ffffffffa00dbba2>] nfs4_free_closedata+0x53/0x63 [nfsv4]
[   34.552661]  [<ffffffffa008f997>] rpc_release_calldata+0x17/0x19 [sunrpc]
[   34.552671]  [<ffffffffa008f9f5>] rpc_free_task+0x5c/0x65 [sunrpc]
[   34.552680]  [<ffffffffa008fe07>] rpc_async_release+0x15/0x17 [sunrpc]
[   34.552684]  [<ffffffff810632b7>] process_one_work+0x192/0x2a0
[   34.552693]  [<ffffffffa008fdf2>] ? rpc_async_schedule+0x33/0x33 [sunrpc]
[   34.552697]  [<ffffffff81064169>] worker_thread+0x140/0x1d7
[   34.552700]  [<ffffffff81064029>] ? manage_workers+0x23b/0x23b
[   34.552704]  [<ffffffff81067d21>] kthread+0x8d/0x95
[   34.552708]  [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43
[   34.552713]  [<ffffffff814ef1ac>] ret_from_fork+0x7c/0xb0
[   34.552717]  [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43

And the client's task list:

[  174.574006] -pid- flgs status -client- --rqstp- -timeout ---ops--
[  174.574019]  1664 0181     -5 ffff880226474600   (null)        0 ffffffffa00f7ce0 nfsv4 DELEGRETURN a:rpc_exit_task [sunrpc] q:none

So it looks like a CLOSE's rpc_release_calldata is triggering the nfs_kill_super and this is stuck trying to kill the DELEGRETURN task - which never gets run.

I've debugged this from the workqueue side and the DELEGRETURN work is scheduled, but ends up having insert_wq_barrier called on it.  It seems to me that this means the work queue is enforcing queue ordering that the CLOSE work should complete before the DELEGRETURN work can proceed -- and *that* is the deadlock (CLOSE waiting until DELEGRETURN is dead, DELEGRETURN can't run until close is complete).

This would also explain our (Trond and me) failed attempts at canceling / rescheduling jobs in killall_tasks -- insert_wq_barrier's comment states:

* Currently, a queued barrier can't be canceled.  This is because
* try_to_grab_pending() can't determine whether the work to be
* grabbed is at the head of the queue and thus can't clear LINKED
* flag of the previous work while there must be a valid next work
* after a work with LINKED flag set.

Now that I have a better understanding of what's happening, I'll go back to the drawing board.

Thanks!
-dros--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index c285e0a..9186a96 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1010,9 +1010,11 @@  EXPORT_SYMBOL_GPL(nfs_alloc_server);
 /*
  * Free up a server record
  */
-void nfs_free_server(struct nfs_server *server)
+static void nfs_free_server_schedule_work(struct work_struct *work)
 {
-	dprintk("--> nfs_free_server()\n");
+	struct nfs_server *server = container_of(work, struct nfs_server, work);
+
+	dprintk("--> %s\n", __func__);
 
 	nfs_server_remove_lists(server);
 
@@ -1032,7 +1034,19 @@  void nfs_free_server(struct nfs_server *server)
 	bdi_destroy(&server->backing_dev_info);
 	kfree(server);
 	nfs_release_automount_timer();
-	dprintk("<-- nfs_free_server()\n");
+	dprintk("<-- %s\n", __func__);
+}
+
+/*
+ * Queue work on nfsiod workqueue to free up a server record.
+ * This avoids a deadlock when an RPC task scheduled from the rpciod
+ * workqueue tries to kill itself.
+ */
+void nfs_free_server(struct nfs_server *server)
+{
+	WARN_ON_ONCE(work_pending(&server->work));
+	INIT_WORK(&server->work, nfs_free_server_schedule_work);
+	queue_work(nfsiod_workqueue, &server->work);
 }
 EXPORT_SYMBOL_GPL(nfs_free_server);
 
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index a9e76ee..a607886 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -171,6 +171,7 @@  struct nfs_server {
 	void (*destroy)(struct nfs_server *);
 
 	atomic_t active; /* Keep trace of any activity to this server */
+	struct work_struct	work;		/* used to schedule free */
 
 	/* mountd-related mount options */
 	struct sockaddr_storage	mountd_address;

NFS: avoid deadlock in nfs_kill_super

Commit Message

Comments

Patch