panic on 4.20 server exporting xfs filesystem

Message ID	20150305131731.GA16235@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Date: Thu, 5 Mar 2015 14:17:31 +0100 From: Christoph Hellwig <hch@lst.de> To: "J. Bruce Fields" <bfields@fieldses.org> Cc: Dave Chinner <david@fromorbit.com>, Eric Sandeen <sandeen@sandeen.net>, linux-nfs@vger.kernel.org, Christoph Hellwig <hch@lst.de>, xfs@oss.sgi.com Subject: Re: panic on 4.20 server exporting xfs filesystem Message-ID: <20150305131731.GA16235@lst.de> References: <20150303221033.GB19439@fieldses.org> <20150303224456.GV4251@dastard> <20150304020826.GD19439@fieldses.org> <20150304155421.GE1627@fieldses.org> <20150304220900.GX18360@dastard> <20150304222709.GI1627@fieldses.org> <20150304224557.GY4251@dastard> <54F78BE5.1020608@sandeen.net> <20150304225623.GZ4251@dastard> <20150305040849.GJ1627@fieldses.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150305040849.GJ1627@fieldses.org> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk

Christoph Hellwig March 5, 2015, 1:17 p.m. UTC

On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> Ah-hah:
> 
> 	static void
> 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> 	{
> 		...
> 		nfsd4_cb_layout_fail(ls);
> 
> That'd do it!
> 
> Haven't tried to figure out why exactly that's getting called, and why
> only rarely.  Some intermittent problem with the callback path, I guess.
> 
> Anyway, I think that solves most of the mystery....

Ooops, that was a nasty git merge error in the last rebase, see the fix
below.  But I really wonder if we need to make the usage of pnfs explicit
after all, othterwise we'll always hand out layouts on any XFS-exported
filesystems, which can't be used and will eventually need to be recalled.

---
From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 5 Mar 2015 06:12:29 -0700
Subject: nfsd: don't recursively call nfsd4_cb_layout_fail

Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS
layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself,
leading to stack overflows.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/nfsd/nfs4layouts.c | 2 --
 1 file changed, 2 deletions(-)

J. Bruce Fields March 5, 2015, 3:01 p.m. UTC | #1

On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > Ah-hah:
> > 
> > 	static void
> > 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > 	{
> > 		...
> > 		nfsd4_cb_layout_fail(ls);
> > 
> > That'd do it!
> > 
> > Haven't tried to figure out why exactly that's getting called, and why
> > only rarely.  Some intermittent problem with the callback path, I guess.
> > 
> > Anyway, I think that solves most of the mystery....
> 
> Ooops, that was a nasty git merge error in the last rebase, see the fix
> below.

Thanks!

> But I really wonder if we need to make the usage of pnfs explicit
> after all, othterwise we'll always hand out layouts on any XFS-exported
> filesystems, which can't be used and will eventually need to be recalled.

Yeah, maybe.  We could check how many GETLAYOUTs we're actually seeing
on these tests.  In theory the client could quit asking, or only ask
every n seconds, if the layouts it gets are all turning out to be
useless.

--b.

> 
> ---
> >From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 5 Mar 2015 06:12:29 -0700
> Subject: nfsd: don't recursively call nfsd4_cb_layout_fail
> 
> Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS
> layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself,
> leading to stack overflows.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/nfsd/nfs4layouts.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index 3c1bfa1..1028a06 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
>  
>  	rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));
>  
> -	nfsd4_cb_layout_fail(ls);
> -
>  	printk(KERN_WARNING
>  		"nfsd: client %s failed to respond to layout recall. "
>  		"  Fencing..\n", addr_str);
> -- 
> 1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields March 5, 2015, 5:02 p.m. UTC | #2

On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote:
> On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > > Ah-hah:
> > > 
> > > 	static void
> > > 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > > 	{
> > > 		...
> > > 		nfsd4_cb_layout_fail(ls);
> > > 
> > > That'd do it!
> > > 
> > > Haven't tried to figure out why exactly that's getting called, and why
> > > only rarely.  Some intermittent problem with the callback path, I guess.
> > > 
> > > Anyway, I think that solves most of the mystery....
> > 
> > Ooops, that was a nasty git merge error in the last rebase, see the fix
> > below.
> 
> Thanks!

And with that fix things look good.

I'm still curious why the callbacks are failling.  It's also logging
"nfsd: client 192.168.122.32 failed to respond to layout recall".

I may not get a chance to debug for another week or two.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields March 5, 2015, 8:47 p.m. UTC | #3

On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote:
> On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote:
> > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > > > Ah-hah:
> > > > 
> > > > 	static void
> > > > 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > > > 	{
> > > > 		...
> > > > 		nfsd4_cb_layout_fail(ls);
> > > > 
> > > > That'd do it!
> > > > 
> > > > Haven't tried to figure out why exactly that's getting called, and why
> > > > only rarely.  Some intermittent problem with the callback path, I guess.
> > > > 
> > > > Anyway, I think that solves most of the mystery....
> > > 
> > > Ooops, that was a nasty git merge error in the last rebase, see the fix
> > > below.
> > 
> > Thanks!
> 
> And with that fix things look good.
> 
> I'm still curious why the callbacks are failling.  It's also logging
> "nfsd: client 192.168.122.32 failed to respond to layout recall".

I spoke too soon, I'm still not getting through my usual test run--the most
recent run is hanging in generic/247 with the following in the server logs.

But I probably still won't get a chance to look at this any closer till after
vault.

--b.

nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
nfsd: fence failed for client 192.168.122.32: -2!
nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
nfsd: fence failed for client 192.168.122.32: -2!
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62
kswapd0: page allocation failure: order:0, mode:0x120
CPU: 0 PID: 580 Comm: kswapd0 Not tainted 4.0.0-rc2-09922-g26cbcc7 #89
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
 0000000000000000 ffff88007f803bc8 ffffffff81a92597 0000000080000101
 0000000000000120 ffff88007f803c58 ffffffff81155322 ffffffff822e9448
 0000000000000000 0000000000000000 0000000000000001 00000000ffffffff
Call Trace:
 <IRQ>  [<ffffffff81a92597>] dump_stack+0x4f/0x7b
 [<ffffffff81155322>] warn_alloc_failed+0xe2/0x130
 [<ffffffff81158503>] __alloc_pages_nodemask+0x5f3/0x860
 [<ffffffff8195b2ab>] skb_page_frag_refill+0xab/0xd0
 [<ffffffff817b5fba>] try_fill_recv+0x38a/0x7d0
 [<ffffffff817b6af2>] virtnet_receive+0x6f2/0x860
 [<ffffffff817b6c81>] virtnet_poll+0x21/0x90
 [<ffffffff8197344a>] net_rx_action+0x1aa/0x320
 [<ffffffff810796f2>] __do_softirq+0xf2/0x370
 [<ffffffff81079ad5>] irq_exit+0x65/0x70
 [<ffffffff81a9e124>] do_IRQ+0x64/0x100
 [<ffffffff811d56f2>] ? free_buffer_head+0x22/0x80
 [<ffffffff81a9c16f>] common_interrupt+0x6f/0x6f
 <EOI>  [<ffffffff81092224>] ? kernel_text_address+0x64/0x70
 [<ffffffff81197ff7>] ? kmem_cache_free+0xb7/0x230
 [<ffffffff811d56f2>] free_buffer_head+0x22/0x80
 [<ffffffff811d5b19>] try_to_free_buffers+0x79/0xc0
 [<ffffffff81394b1b>] xfs_vm_releasepage+0x6b/0x160
 [<ffffffff81184dc0>] ? __page_check_address+0x1f0/0x1f0
 [<ffffffff8114fcc5>] try_to_release_page+0x35/0x60
 [<ffffffff81162e02>] shrink_page_list+0x9c2/0xba0
 [<ffffffff810bba1d>] ? trace_hardirqs_on_caller+0x12d/0x1d0
 [<ffffffff811635df>] shrink_inactive_list+0x23f/0x520
 [<ffffffff81164155>] shrink_lruvec+0x595/0x6d0
 [<ffffffff81161a39>] ? shrink_slab.part.58.constprop.67+0x299/0x410
 [<ffffffff811642dd>] shrink_zone+0x4d/0xa0
 [<ffffffff81164c01>] kswapd+0x471/0xa30
 [<ffffffff81164790>] ? try_to_free_pages+0x460/0x460
 [<ffffffff81093a2f>] kthread+0xef/0x110
 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
CPU    1: hi:  186, btch:  31 usd:  64
active_anon:7053 inactive_anon:2435 isolated_anon:0
 active_file:88743 inactive_file:89505 isolated_file:32
 unevictable:0 dirty:9786 writeback:0 unstable:0
 free:3571 slab_reclaimable:227807 slab_unreclaimable:75772
 mapped:21010 shmem:380 pagetables:1567 bounce:0
 free_cma:0
DMA free:7788kB min:44kB low:52kB high:64kB active_anon:204kB inactive_anon:260kB active_file:696kB inactive_file:1016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:32kB writeback:0kB mapped:924kB shmem:36kB slab_reclaimable:1832kB slab_unreclaimable:1632kB kernel_stack:48kB pagetables:80kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 1943 1943 1943
DMA32 free:6188kB min:5616kB low:7020kB high:8424kB active_anon:28008kB inactive_anon:9480kB active_file:354360kB inactive_file:357004kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1990060kB mlocked:0kB dirty:39112kB writeback:0kB mapped:83116kB shmem:1484kB slab_reclaimable:909396kB slab_unreclaimable:301708kB kernel_stack:2608kB pagetables:6188kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 21*4kB (UEM) 11*8kB (UM) 10*16kB (UEM) 3*32kB (UM) 1*64kB (E) 1*128kB (E) 2*256kB (UM) 1*512kB (M) 2*1024kB (UM) 2*2048kB (ER) 0*4096kB = 7788kB
DMA32: 276*4kB (UM) 30*8kB (UM) 83*16kB (UM) 2*32kB (MR) 2*64kB (M) 2*128kB (R) 2*256kB (R) 1*512kB (R) 0*1024kB 1*2048kB (R) 0*4096kB = 6192kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
178684 total pagecache pages
1 pages in swap cache
Swap cache stats: add 14, delete 13, find 0/0
Free swap  = 839620kB
Total swap = 839676kB
524158 pages RAM
0 pages HighMem/MovableOnly
22666 pages reserved
nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
nfsd: fence failed for client 192.168.122.32: -2!
receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf
INFO: task nfsd:17653 blocked for more than 120 seconds.
      Not tainted 4.0.0-rc2-09922-g26cbcc7 #89
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
nfsd            D ffff8800753a7848 11720 17653      2 0x00000000
 ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580
 ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848
 ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50
Call Trace:
 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
 [<ffffffff81a95737>] schedule+0x37/0x90
 [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30
 [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400
 [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
 [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
 [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130
 [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0
 [<ffffffff8119f0b2>] do_readv_writev+0xe2/0x280
 [<ffffffff813a0f90>] ? xfs_file_buffered_aio_write.isra.9+0x2a0/0x2a0
 [<ffffffff813a0f90>] ? xfs_file_buffered_aio_write.isra.9+0x2a0/0x2a0
 [<ffffffff810bc15d>] ? __lock_acquire+0x37d/0x2040
 [<ffffffff810be759>] ? lock_release_non_nested+0xa9/0x340
 [<ffffffff81a9aaad>] ? _raw_spin_unlock_irqrestore+0x5d/0x80
 [<ffffffff810bba1d>] ? trace_hardirqs_on_caller+0x12d/0x1d0
 [<ffffffff810bbacd>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffff8119f2d9>] vfs_writev+0x39/0x50
 [<ffffffffa00ab861>] nfsd_vfs_write.isra.11+0xa1/0x350 [nfsd]
 [<ffffffffa00aecee>] nfsd_write+0x8e/0x100 [nfsd]
 [<ffffffffa00b95b5>] ? gen_boot_verifier+0x5/0xc0 [nfsd]
 [<ffffffffa00b97f5>] nfsd4_write+0x185/0x1e0 [nfsd]
 [<ffffffffa00bbe37>] nfsd4_proc_compound+0x3c7/0x6f0 [nfsd]
 [<ffffffffa00a7463>] nfsd_dispatch+0xc3/0x220 [nfsd]
 [<ffffffffa001314f>] svc_process_common+0x43f/0x650 [sunrpc]
 [<ffffffffa00134a3>] svc_process+0x143/0x260 [sunrpc]
 [<ffffffffa00a6cc7>] nfsd+0x167/0x1e0 [nfsd]
 [<ffffffffa00a6b65>] ? nfsd+0x5/0x1e0 [nfsd]
 [<ffffffffa00a6b60>] ? nfsd_destroy+0xe0/0xe0 [nfsd]
 [<ffffffff81093a2f>] kthread+0xef/0x110
 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220
2 locks held by nfsd/17653:
 #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff8119f208>] do_readv_writev+0x238/0x280
 #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
INFO: task nfsd:17655 blocked for more than 120 seconds.
...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner March 5, 2015, 8:59 p.m. UTC | #4

On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote:
> On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote:
> > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote:
> > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > > > > Ah-hah:
> > > > > 
> > > > > 	static void
> > > > > 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > > > > 	{
> > > > > 		...
> > > > > 		nfsd4_cb_layout_fail(ls);
> > > > > 
> > > > > That'd do it!
> > > > > 
> > > > > Haven't tried to figure out why exactly that's getting called, and why
> > > > > only rarely.  Some intermittent problem with the callback path, I guess.
> > > > > 
> > > > > Anyway, I think that solves most of the mystery....
> > > > 
> > > > Ooops, that was a nasty git merge error in the last rebase, see the fix
> > > > below.
> > > 
> > > Thanks!
> > 
> > And with that fix things look good.
> > 
> > I'm still curious why the callbacks are failling.  It's also logging
> > "nfsd: client 192.168.122.32 failed to respond to layout recall".
> 
> I spoke too soon, I'm still not getting through my usual test run--the most
> recent run is hanging in generic/247 with the following in the server logs.
> 
> But I probably still won't get a chance to look at this any closer till after
> vault.
> 
> --b.
> 
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62
> kswapd0: page allocation failure: order:0, mode:0x120

[snip network driver memory allocation failure]

> active_anon:7053 inactive_anon:2435 isolated_anon:0
>  active_file:88743 inactive_file:89505 isolated_file:32
>  unevictable:0 dirty:9786 writeback:0 unstable:0
>  free:3571 slab_reclaimable:227807 slab_unreclaimable:75772
>  mapped:21010 shmem:380 pagetables:1567 bounce:0
>  free_cma:0

Looks like there should be heaps of reclaimable memory...

> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..

So there's a layout recall pending...

> nfsd: fence failed for client 192.168.122.32: -2!
> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf
> INFO: task nfsd:17653 blocked for more than 120 seconds.
>       Not tainted 4.0.0-rc2-09922-g26cbcc7 #89
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> nfsd            D ffff8800753a7848 11720 17653      2 0x00000000
>  ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580
>  ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848
>  ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50
> Call Trace:
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff81a95737>] schedule+0x37/0x90
>  [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30
>  [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400
>  [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
>  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
>  [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130
>  [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0

and the nfsd got hung up on the inode mutex during  a write.

Which means there's some other process blocked holding the i_mutex.
sysrq-w and sysrq-t is probably going to tell us more here.

I suspect we'll have another write stuck in break_layout().....

Cheers,

Dave.

J. Bruce Fields March 6, 2015, 8:47 p.m. UTC | #5

On Fri, Mar 06, 2015 at 07:59:22AM +1100, Dave Chinner wrote:
> On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote:
> > On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote:
> > > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote:
> > > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote:
> > > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
> > > > > > Ah-hah:
> > > > > > 
> > > > > > 	static void
> > > > > > 	nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
> > > > > > 	{
> > > > > > 		...
> > > > > > 		nfsd4_cb_layout_fail(ls);
> > > > > > 
> > > > > > That'd do it!
> > > > > > 
> > > > > > Haven't tried to figure out why exactly that's getting called, and why
> > > > > > only rarely.  Some intermittent problem with the callback path, I guess.
> > > > > > 
> > > > > > Anyway, I think that solves most of the mystery....
> > > > > 
> > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix
> > > > > below.
> > > > 
> > > > Thanks!
> > > 
> > > And with that fix things look good.
> > > 
> > > I'm still curious why the callbacks are failling.  It's also logging
> > > "nfsd: client 192.168.122.32 failed to respond to layout recall".
> > 
> > I spoke too soon, I'm still not getting through my usual test run--the most
> > recent run is hanging in generic/247 with the following in the server logs.
> > 
> > But I probably still won't get a chance to look at this any closer till after
> > vault.
> > 
> > --b.
> > 
> > nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> > nfsd: fence failed for client 192.168.122.32: -2!
> > nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> > nfsd: fence failed for client 192.168.122.32: -2!
> > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62
> > kswapd0: page allocation failure: order:0, mode:0x120
> 
> [snip network driver memory allocation failure]
> 
> > active_anon:7053 inactive_anon:2435 isolated_anon:0
> >  active_file:88743 inactive_file:89505 isolated_file:32
> >  unevictable:0 dirty:9786 writeback:0 unstable:0
> >  free:3571 slab_reclaimable:227807 slab_unreclaimable:75772
> >  mapped:21010 shmem:380 pagetables:1567 bounce:0
> >  free_cma:0
> 
> Looks like there should be heaps of reclaimable memory...
> 
> > nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> 
> So there's a layout recall pending...
> 
> > nfsd: fence failed for client 192.168.122.32: -2!
> > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf
> > INFO: task nfsd:17653 blocked for more than 120 seconds.
> >       Not tainted 4.0.0-rc2-09922-g26cbcc7 #89
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > nfsd            D ffff8800753a7848 11720 17653      2 0x00000000
> >  ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580
> >  ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848
> >  ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50
> > Call Trace:
> >  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> >  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> >  [<ffffffff81a95737>] schedule+0x37/0x90
> >  [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30
> >  [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400
> >  [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
> >  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> >  [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0
> >  [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0
> >  [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130
> >  [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0
> 
> and the nfsd got hung up on the inode mutex during  a write.
> 
> Which means there's some other process blocked holding the i_mutex.
> sysrq-w and sysrq-t is probably going to tell us more here.
> 
> I suspect we'll have another write stuck in break_layout().....

Yup!  There's a bunch of threads similarly stuck in write, and then:

# cat /proc/17765/stack 
[<ffffffff811edb18>] __break_lease+0x278/0x510
[<ffffffff813cd4d4>] xfs_break_layouts+0x94/0xf0
[<ffffffff813a0903>] xfs_file_aio_write_checks+0x53/0x100
[<ffffffff813a0d7b>] xfs_file_buffered_aio_write.isra.9+0x8b/0x2a0
[<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130
[<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0
[<ffffffff8119f0b2>] do_readv_writev+0xe2/0x280
[<ffffffff8119f2d9>] vfs_writev+0x39/0x50
[<ffffffffa00ab861>] nfsd_vfs_write.isra.11+0xa1/0x350 [nfsd]
[<ffffffffa00aecee>] nfsd_write+0x8e/0x100 [nfsd]
[<ffffffffa00b97f5>] nfsd4_write+0x185/0x1e0 [nfsd]
[<ffffffffa00bbe37>] nfsd4_proc_compound+0x3c7/0x6f0 [nfsd]
[<ffffffffa00a7463>] nfsd_dispatch+0xc3/0x220 [nfsd]
[<ffffffffa001314f>] svc_process_common+0x43f/0x650 [sunrpc]
[<ffffffffa00134a3>] svc_process+0x143/0x260 [sunrpc]
[<ffffffffa00a6cc7>] nfsd+0x167/0x1e0 [nfsd]
[<ffffffff81093a2f>] kthread+0xef/0x110
[<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

I'm worried by the blocking break_lease in xfs_break_layouts.  There's a
circular dependency: blocking in break_lease ties up an nfsd thread, but
we'll need an nfsd thread to process the client's layout return.

But in the worst case I'd still expect that to be cleaned up if the
client doesn't return the layout within a lease period (20 seconds on my
server).

In addition to fencing the client, surely we should be forcibly revoking
the layout at that point?

-b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig March 8, 2015, 3:30 p.m. UTC | #6

On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote:
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!
> nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> nfsd: fence failed for client 192.168.122.32: -2!

There is no userspace elper to do the fencing, so unfortunately this
is expecvted.

> receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62

Now this looks like some issue with the low-level callback path.  I've never
seen tis before, but from looking at receive_cb_reply this happens if
xprt_lookup_rqst can't find a rpc_rqst structured for the xid.  Looks like
we might be corrupting the request list / xid allocation somewhere?

I can prepare a patch for you to aid with xid tracing if you want.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields March 9, 2015, 7:45 p.m. UTC | #7

On Sun, Mar 08, 2015 at 04:30:56PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote:
> > nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> > nfsd: fence failed for client 192.168.122.32: -2!
> > nfsd: client 192.168.122.32 failed to respond to layout recall.   Fencing..
> > nfsd: fence failed for client 192.168.122.32: -2!
> 
> There is no userspace elper to do the fencing, so unfortunately this
> is expecvted.
> 
> > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62
> 
> Now this looks like some issue with the low-level callback path.  I've never
> seen tis before, but from looking at receive_cb_reply this happens if
> xprt_lookup_rqst can't find a rpc_rqst structured for the xid.  Looks like
> we might be corrupting the request list / xid allocation somewhere?
> 
> I can prepare a patch for you to aid with xid tracing if you want.

I'll take a look when I get back.  But before that I'd like to
understand why the layout seems to be left here blocking writes forever,
instead of getting cleaned up after a lease period with no layout
return.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig March 19, 2015, 5:27 p.m. UTC | #8

FYI, I've now managed to reproduce the issue.  I haven't had time
to dig deeper, but it smells a lot like a callback path issue.

Can you send the recursion fix to Linus ASAP and I'll send you a patch
to turn the nopnfs option into a pnfs one, so we're at least doing
fine for 4.0 in any case.  I hope to debug this and may even have
a real fix soon, but so far I don't know how long it will take.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

J. Bruce Fields March 19, 2015, 6:47 p.m. UTC | #9

On Thu, Mar 19, 2015 at 06:27:31PM +0100, Christoph Hellwig wrote:
> FYI, I've now managed to reproduce the issue.  I haven't had time
> to dig deeper, but it smells a lot like a callback path issue.
> 
> Can you send the recursion fix to Linus ASAP and I'll send you a patch
> to turn the nopnfs option into a pnfs one, so we're at least doing
> fine for 4.0 in any case.

Sure, sounds good.

Also, there's the problem that when this is turned on a client can end
up doing unnecessary LAYOUTGET.  Do we have a plan for that?

Possibilities:

	- Just depend on export flags: but some clients may have direct
	  access and some not.  If the clients with direct access or all
	  easily identifiable by IP subnet, maybe it's not a big deal.
	  Still, seems like an administrative hassle.

	- Do nothing, assume the client can deal with this with some
	  kind of heuristics, and/or that the GETLAYOUT calls can be
	  made very cheap.  Not sure if that's true.

	- Use something like GETDEVLICELIST so the client can figure out
	  in one go whether any layouts on a given filesystem will work.
	  I forget what the problems with GETDEVICELIST were.

> I hope to debug this and may even have
> a real fix soon, but so far I don't know how long it will take.

OK, thanks.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Kinglong Mee March 20, 2015, 4:06 a.m. UTC | #10

On Thu, Mar 5, 2015 at 9:17 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote:
>> Ah-hah:
>>
>>       static void
>>       nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
>>       {
>>               ...
>>               nfsd4_cb_layout_fail(ls);
>>
>> That'd do it!
>>
>> Haven't tried to figure out why exactly that's getting called, and why
>> only rarely.  Some intermittent problem with the callback path, I guess.
>>
>> Anyway, I think that solves most of the mystery....
>
> Ooops, that was a nasty git merge error in the last rebase, see the fix
> below.  But I really wonder if we need to make the usage of pnfs explicit
> after all, othterwise we'll always hand out layouts on any XFS-exported
> filesystems, which can't be used and will eventually need to be recalled.
>
> ---
> From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 5 Mar 2015 06:12:29 -0700
> Subject: nfsd: don't recursively call nfsd4_cb_layout_fail
>
> Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS
> layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself,
> leading to stack overflows.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/nfsd/nfs4layouts.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
> index 3c1bfa1..1028a06 100644
> --- a/fs/nfsd/nfs4layouts.c
> +++ b/fs/nfsd/nfs4layouts.c
> @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
>
>         rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));
>
> -       nfsd4_cb_layout_fail(ls);
> -

Maybe you want adding "trace_layout_recall_fail(&ls->ls_stid.sc_stateid);" here?
I think the following is better,

@@ -587,7 +587,7 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)

        rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));

-       nfsd4_cb_layout_fail(ls);
+       trace_layout_recall_fail(&ls->ls_stid.sc_stateid);

        printk(KERN_WARNING
                "nfsd: client %s failed to respond to layout recall. "

thanks,
Kinglong Mee
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig March 20, 2015, 6:49 a.m. UTC | #11

On Thu, Mar 19, 2015 at 02:47:14PM -0400, J. Bruce Fields wrote:
> Also, there's the problem that when this is turned on a client can end
> up doing unnecessary LAYOUTGET.  Do we have a plan for that?
> 
> Possibilities:
> 
> 	- Just depend on export flags: but some clients may have direct
> 	  access and some not.  If the clients with direct access or all
> 	  easily identifiable by IP subnet, maybe it's not a big deal.
> 	  Still, seems like an administrative hassle.

We defintively want this to avoid getting into problems.

> 
> 	- Do nothing, assume the client can deal with this with some
> 	  kind of heuristics, and/or that the GETLAYOUT calls can be
> 	  made very cheap.  Not sure if that's true.

The calls itself are cheap, the cliet processing of them isn't.  I think we
should just stop issueing GETLAYOUT calls on the client side if we keep
errors again and again.  One option might be to add negative device id
cache entries, similar to how negative dentries work in the dcache.

> 	- Use something like GETDEVLICELIST so the client can figure out
> 	  in one go whether any layouts on a given filesystem will work.
> 	  I forget what the problems with GETDEVICELIST were.

The way the device IDs rules are written in NFS it is inherently racy.

If I could go back 10 years in time I'd rewrite device ids to be stateids
bound to a fsid, and a lot of things could be fixed up neatly that way..
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Hellwig March 20, 2015, 6:50 a.m. UTC | #12

On Fri, Mar 20, 2015 at 12:06:18PM +0800, Kinglong Mee wrote:
> Maybe you want adding "trace_layout_recall_fail(&ls->ls_stid.sc_stateid);" here?
> I think the following is better,

It is!  That's indeed what I had in my old trees before the big rebase.

Can you write up a small changelog and add a signoff?

Thanks a lot!
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

panic on 4.20 server exporting xfs filesystem

Commit Message

Comments

Patch