btrfs: nofs inode allocations

Message ID	20190909141204.24557-1-josef@toxicpanda.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=qsaG=XE=vger.kernel.org=linux-btrfs-owner@kernel.org> From: Josef Bacik <josef@toxicpanda.com> To: kernel-team@fb.com, linux-btrfs@vger.kernel.org Cc: Zdenek Sojka <zsojka@seznam.cz> Subject: [PATCH] btrfs: nofs inode allocations Date: Mon, 9 Sep 2019 10:12:04 -0400 Message-Id: <20190909141204.24557-1-josef@toxicpanda.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk
Series	btrfs: nofs inode allocations \| expand btrfs: nofs inode allocations

Message ID

20190909141204.24557-1-josef@toxicpanda.com (mailing list archive)

State

New, archived

Headers

From: Josef Bacik <josef@toxicpanda.com>
To: kernel-team@fb.com, linux-btrfs@vger.kernel.org
Cc: Zdenek Sojka <zsojka@seznam.cz>
Subject: [PATCH] btrfs: nofs inode allocations
Date: Mon,  9 Sep 2019 10:12:04 -0400
Message-Id: <20190909141204.24557-1-josef@toxicpanda.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-btrfs-owner@vger.kernel.org
Precedence: bulk

Series

btrfs: nofs inode allocations | expand

Commit Message

Josef Bacik Sept. 9, 2019, 2:12 p.m. UTC

A user reported a lockdep splat

 ======================================================
 WARNING: possible circular locking dependency detected
 5.2.11-gentoo #2 Not tainted
 ------------------------------------------------------
 kswapd0/711 is trying to acquire lock:
 000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500

but task is already holding lock:
 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
 kmem_cache_alloc+0x1f/0x1c0
 btrfs_alloc_inode+0x1f/0x260
 alloc_inode+0x16/0xa0
 new_inode+0xe/0xb0
 btrfs_new_inode+0x70/0x610
 btrfs_symlink+0xd0/0x420
 vfs_symlink+0x9c/0x100
 do_symlinkat+0x66/0xe0
 do_syscall_64+0x55/0x1c0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (sb_internal){.+.+}:
 __sb_start_write+0xf6/0x150
 start_transaction+0x3a8/0x500
 btrfs_commit_inode_delayed_inode+0x59/0x110
 btrfs_evict_inode+0x19e/0x4c0
 evict+0xbc/0x1f0
 inode_lru_isolate+0x113/0x190
 __list_lru_walk_one.isra.4+0x5c/0x100
 list_lru_walk_one+0x32/0x50
 prune_icache_sb+0x36/0x80
 super_cache_scan+0x14a/0x1d0
 do_shrink_slab+0x131/0x320
 shrink_node+0xf7/0x380
 balance_pgdat+0x2d5/0x640
 kswapd+0x2ba/0x5e0
 kthread+0x147/0x160
 ret_from_fork+0x24/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

 CPU0 CPU1
 ---- ----
 lock(fs_reclaim);
 lock(sb_internal);
 lock(fs_reclaim);
 lock(sb_internal);
*** DEADLOCK ***

 3 locks held by kswapd0/711:
 #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
 #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
 #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0

stack backtrace:
 CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
 Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
 Call Trace:
 dump_stack+0x85/0xc7
 print_circular_bug.cold.40+0x1d9/0x235
 __lock_acquire+0x18b1/0x1f00
 lock_acquire+0xa6/0x170
 ? start_transaction+0x3a8/0x500
 __sb_start_write+0xf6/0x150
 ? start_transaction+0x3a8/0x500
 start_transaction+0x3a8/0x500
 btrfs_commit_inode_delayed_inode+0x59/0x110
 btrfs_evict_inode+0x19e/0x4c0
 ? var_wake_function+0x20/0x20
 evict+0xbc/0x1f0
 inode_lru_isolate+0x113/0x190
 ? discard_new_inode+0xc0/0xc0
 __list_lru_walk_one.isra.4+0x5c/0x100
 ? discard_new_inode+0xc0/0xc0
 list_lru_walk_one+0x32/0x50
 prune_icache_sb+0x36/0x80
 super_cache_scan+0x14a/0x1d0
 do_shrink_slab+0x131/0x320
 shrink_node+0xf7/0x380
 balance_pgdat+0x2d5/0x640
 kswapd+0x2ba/0x5e0
 ? __wake_up_common_lock+0x90/0x90
 kthread+0x147/0x160
 ? balance_pgdat+0x640/0x640
 ? __kthread_create_on_node+0x160/0x160
 ret_from_fork+0x24/0x30

This is because btrfs_new_inode() calls new_inode() under the
transaction.  We could probably move the new_inode() outside of this but
for now just wrap it in memalloc_nofs_save().

Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Nikolay Borisov Sept. 9, 2019, 2:28 p.m. UTC | #1

On 9.09.19 г. 17:12 ч., Josef Bacik wrote:
> A user reported a lockdep splat
> 
>  ======================================================
>  WARNING: possible circular locking dependency detected
>  5.2.11-gentoo #2 Not tainted
>  ------------------------------------------------------
>  kswapd0/711 is trying to acquire lock:
>  000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
> 
> but task is already holding lock:
>  000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (fs_reclaim){+.+.}:
>  kmem_cache_alloc+0x1f/0x1c0
>  btrfs_alloc_inode+0x1f/0x260
>  alloc_inode+0x16/0xa0
>  new_inode+0xe/0xb0
>  btrfs_new_inode+0x70/0x610
>  btrfs_symlink+0xd0/0x420
>  vfs_symlink+0x9c/0x100
>  do_symlinkat+0x66/0xe0
>  do_syscall_64+0x55/0x1c0
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> -> #0 (sb_internal){.+.+}:
>  __sb_start_write+0xf6/0x150
>  start_transaction+0x3a8/0x500
>  btrfs_commit_inode_delayed_inode+0x59/0x110
>  btrfs_evict_inode+0x19e/0x4c0
>  evict+0xbc/0x1f0
>  inode_lru_isolate+0x113/0x190
>  __list_lru_walk_one.isra.4+0x5c/0x100
>  list_lru_walk_one+0x32/0x50
>  prune_icache_sb+0x36/0x80
>  super_cache_scan+0x14a/0x1d0
>  do_shrink_slab+0x131/0x320
>  shrink_node+0xf7/0x380
>  balance_pgdat+0x2d5/0x640
>  kswapd+0x2ba/0x5e0
>  kthread+0x147/0x160
>  ret_from_fork+0x24/0x30
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>  CPU0 CPU1
>  ---- ----
>  lock(fs_reclaim);
>  lock(sb_internal);
>  lock(fs_reclaim);
>  lock(sb_internal);
> *** DEADLOCK ***
> 
>  3 locks held by kswapd0/711:
>  #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
>  #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
>  #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
> 
> stack backtrace:
>  CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
>  Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
>  Call Trace:
>  dump_stack+0x85/0xc7
>  print_circular_bug.cold.40+0x1d9/0x235
>  __lock_acquire+0x18b1/0x1f00
>  lock_acquire+0xa6/0x170
>  ? start_transaction+0x3a8/0x500
>  __sb_start_write+0xf6/0x150
>  ? start_transaction+0x3a8/0x500
>  start_transaction+0x3a8/0x500
>  btrfs_commit_inode_delayed_inode+0x59/0x110
>  btrfs_evict_inode+0x19e/0x4c0
>  ? var_wake_function+0x20/0x20
>  evict+0xbc/0x1f0
>  inode_lru_isolate+0x113/0x190
>  ? discard_new_inode+0xc0/0xc0
>  __list_lru_walk_one.isra.4+0x5c/0x100
>  ? discard_new_inode+0xc0/0xc0
>  list_lru_walk_one+0x32/0x50
>  prune_icache_sb+0x36/0x80
>  super_cache_scan+0x14a/0x1d0
>  do_shrink_slab+0x131/0x320
>  shrink_node+0xf7/0x380
>  balance_pgdat+0x2d5/0x640
>  kswapd+0x2ba/0x5e0
>  ? __wake_up_common_lock+0x90/0x90
>  kthread+0x147/0x160
>  ? balance_pgdat+0x640/0x640
>  ? __kthread_create_on_node+0x160/0x160
>  ret_from_fork+0x24/0x30
> 
> This is because btrfs_new_inode() calls new_inode() under the
> transaction.  We could probably move the new_inode() outside of this but
> for now just wrap it in memalloc_nofs_save().

If I'm understanding correctly what happens here is that symlinking
wants to instantiate a new inode
(new_inode->btrfs_alloc_inode->kmem_cache_alloc with GFP_KERNEL) but it
triggers background reclaim, hence goes to sleep, while holding a
transaction. At the same time background reclaim joins the same
transaction which is now blocked by the symlinking thread, hence it's
prone to deadlock ?

Filipe Manana Sept. 11, 2019, 9:37 a.m. UTC | #2

On Tue, Sep 10, 2019 at 7:22 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> A user reported a lockdep splat
>
>  ======================================================
>  WARNING: possible circular locking dependency detected
>  5.2.11-gentoo #2 Not tainted
>  ------------------------------------------------------
>  kswapd0/711 is trying to acquire lock:
>  000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
>
> but task is already holding lock:
>  000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (fs_reclaim){+.+.}:
>  kmem_cache_alloc+0x1f/0x1c0
>  btrfs_alloc_inode+0x1f/0x260
>  alloc_inode+0x16/0xa0
>  new_inode+0xe/0xb0
>  btrfs_new_inode+0x70/0x610
>  btrfs_symlink+0xd0/0x420
>  vfs_symlink+0x9c/0x100
>  do_symlinkat+0x66/0xe0
>  do_syscall_64+0x55/0x1c0
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #0 (sb_internal){.+.+}:
>  __sb_start_write+0xf6/0x150
>  start_transaction+0x3a8/0x500
>  btrfs_commit_inode_delayed_inode+0x59/0x110
>  btrfs_evict_inode+0x19e/0x4c0
>  evict+0xbc/0x1f0
>  inode_lru_isolate+0x113/0x190
>  __list_lru_walk_one.isra.4+0x5c/0x100
>  list_lru_walk_one+0x32/0x50
>  prune_icache_sb+0x36/0x80
>  super_cache_scan+0x14a/0x1d0
>  do_shrink_slab+0x131/0x320
>  shrink_node+0xf7/0x380
>  balance_pgdat+0x2d5/0x640
>  kswapd+0x2ba/0x5e0
>  kthread+0x147/0x160
>  ret_from_fork+0x24/0x30
>
> other info that might help us debug this:
>
>  Possible unsafe locking scenario:
>
>  CPU0 CPU1
>  ---- ----
>  lock(fs_reclaim);
>  lock(sb_internal);
>  lock(fs_reclaim);
>  lock(sb_internal);
> *** DEADLOCK ***
>
>  3 locks held by kswapd0/711:
>  #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
>  #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
>  #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
>
> stack backtrace:
>  CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
>  Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
>  Call Trace:
>  dump_stack+0x85/0xc7
>  print_circular_bug.cold.40+0x1d9/0x235
>  __lock_acquire+0x18b1/0x1f00
>  lock_acquire+0xa6/0x170
>  ? start_transaction+0x3a8/0x500
>  __sb_start_write+0xf6/0x150
>  ? start_transaction+0x3a8/0x500
>  start_transaction+0x3a8/0x500
>  btrfs_commit_inode_delayed_inode+0x59/0x110
>  btrfs_evict_inode+0x19e/0x4c0
>  ? var_wake_function+0x20/0x20
>  evict+0xbc/0x1f0
>  inode_lru_isolate+0x113/0x190
>  ? discard_new_inode+0xc0/0xc0
>  __list_lru_walk_one.isra.4+0x5c/0x100
>  ? discard_new_inode+0xc0/0xc0
>  list_lru_walk_one+0x32/0x50
>  prune_icache_sb+0x36/0x80
>  super_cache_scan+0x14a/0x1d0
>  do_shrink_slab+0x131/0x320
>  shrink_node+0xf7/0x380
>  balance_pgdat+0x2d5/0x640
>  kswapd+0x2ba/0x5e0
>  ? __wake_up_common_lock+0x90/0x90
>  kthread+0x147/0x160
>  ? balance_pgdat+0x640/0x640
>  ? __kthread_create_on_node+0x160/0x160
>  ret_from_fork+0x24/0x30
>
> This is because btrfs_new_inode() calls new_inode() under the
> transaction.  We could probably move the new_inode() outside of this but
> for now just wrap it in memalloc_nofs_save().
>
> Reported-by: Zdenek Sojka <zsojka@seznam.cz>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

To aid the stable team figure out to where to backport things, adding
a Fixes tag (Fixes: 712e36c5f2a7fa ("btrfs: use GFP_KERNEL in
btrfs_alloc_inode")) or a " CC: stable@vger.kernel.org #  4.16+" would
help.

> ---
>  fs/btrfs/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index aece5dd0e7a8..bf40d1085e4e 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6291,13 +6291,16 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
>         u32 sizes[2];
>         int nitems = name ? 2 : 1;
>         unsigned long ptr;
> +       unsigned long nofs_flag;

Should be unsigned int (it's what memalloc_nofs_save() returns exactly).

Anyway, the change itself looks good to me.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

>         int ret;
>
>         path = btrfs_alloc_path();
>         if (!path)
>                 return ERR_PTR(-ENOMEM);
>
> +       nofs_flag = memalloc_nofs_save();
>         inode = new_inode(fs_info->sb);
> +       memalloc_nofs_restore(nofs_flag);
>         if (!inode) {
>                 btrfs_free_path(path);
>                 return ERR_PTR(-ENOMEM);
> --
> 2.21.0
>

David Sterba Oct. 1, 2019, 6:06 p.m. UTC | #3

On Mon, Sep 09, 2019 at 05:28:11PM +0300, Nikolay Borisov wrote:
> > This is because btrfs_new_inode() calls new_inode() under the
> > transaction.  We could probably move the new_inode() outside of this but
> > for now just wrap it in memalloc_nofs_save().
> 
> If I'm understanding correctly what happens here is that symlinking
> wants to instantiate a new inode
> (new_inode->btrfs_alloc_inode->kmem_cache_alloc with GFP_KERNEL) but it
> triggers background reclaim, hence goes to sleep, while holding a
> transaction. At the same time background reclaim joins the same
> transaction which is now blocked by the symlinking thread, hence it's
> prone to deadlock ?

Yes, that's the pattern, GFP_KERNEL inside
start_transaction/commit_transaction. The more generic fix is in the works.

David Sterba Oct. 1, 2019, 6:11 p.m. UTC | #4

On Mon, Sep 09, 2019 at 10:12:04AM -0400, Josef Bacik wrote:
> A user reported a lockdep splat
> 
>  ======================================================
>  WARNING: possible circular locking dependency detected
>  5.2.11-gentoo #2 Not tainted
>  ------------------------------------------------------
>  kswapd0/711 is trying to acquire lock:
>  000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
> 
> but task is already holding lock:
>  000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (fs_reclaim){+.+.}:
>  kmem_cache_alloc+0x1f/0x1c0
>  btrfs_alloc_inode+0x1f/0x260
>  alloc_inode+0x16/0xa0
>  new_inode+0xe/0xb0
>  btrfs_new_inode+0x70/0x610
>  btrfs_symlink+0xd0/0x420
>  vfs_symlink+0x9c/0x100
>  do_symlinkat+0x66/0xe0
>  do_syscall_64+0x55/0x1c0
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> -> #0 (sb_internal){.+.+}:
>  __sb_start_write+0xf6/0x150
>  start_transaction+0x3a8/0x500
>  btrfs_commit_inode_delayed_inode+0x59/0x110
>  btrfs_evict_inode+0x19e/0x4c0
>  evict+0xbc/0x1f0
>  inode_lru_isolate+0x113/0x190
>  __list_lru_walk_one.isra.4+0x5c/0x100
>  list_lru_walk_one+0x32/0x50
>  prune_icache_sb+0x36/0x80
>  super_cache_scan+0x14a/0x1d0
>  do_shrink_slab+0x131/0x320
>  shrink_node+0xf7/0x380
>  balance_pgdat+0x2d5/0x640
>  kswapd+0x2ba/0x5e0
>  kthread+0x147/0x160
>  ret_from_fork+0x24/0x30
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>  CPU0 CPU1
>  ---- ----
>  lock(fs_reclaim);
>  lock(sb_internal);
>  lock(fs_reclaim);
>  lock(sb_internal);
> *** DEADLOCK ***
> 
>  3 locks held by kswapd0/711:
>  #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
>  #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
>  #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
> 
> stack backtrace:
>  CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
>  Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
>  Call Trace:
>  dump_stack+0x85/0xc7
>  print_circular_bug.cold.40+0x1d9/0x235
>  __lock_acquire+0x18b1/0x1f00
>  lock_acquire+0xa6/0x170
>  ? start_transaction+0x3a8/0x500
>  __sb_start_write+0xf6/0x150
>  ? start_transaction+0x3a8/0x500
>  start_transaction+0x3a8/0x500
>  btrfs_commit_inode_delayed_inode+0x59/0x110
>  btrfs_evict_inode+0x19e/0x4c0
>  ? var_wake_function+0x20/0x20
>  evict+0xbc/0x1f0
>  inode_lru_isolate+0x113/0x190
>  ? discard_new_inode+0xc0/0xc0
>  __list_lru_walk_one.isra.4+0x5c/0x100
>  ? discard_new_inode+0xc0/0xc0
>  list_lru_walk_one+0x32/0x50
>  prune_icache_sb+0x36/0x80
>  super_cache_scan+0x14a/0x1d0
>  do_shrink_slab+0x131/0x320
>  shrink_node+0xf7/0x380
>  balance_pgdat+0x2d5/0x640
>  kswapd+0x2ba/0x5e0
>  ? __wake_up_common_lock+0x90/0x90
>  kthread+0x147/0x160
>  ? balance_pgdat+0x640/0x640
>  ? __kthread_create_on_node+0x160/0x160
>  ret_from_fork+0x24/0x30
> 
> This is because btrfs_new_inode() calls new_inode() under the
> transaction.  We could probably move the new_inode() outside of this but
> for now just wrap it in memalloc_nofs_save().
> 
> Reported-by: Zdenek Sojka <zsojka@seznam.cz>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Added to 5.4 queue, with the type fixed and updated subject.

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index aece5dd0e7a8..bf40d1085e4e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6291,13 +6291,16 @@  static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 	u32 sizes[2];
 	int nitems = name ? 2 : 1;
 	unsigned long ptr;
+	unsigned long nofs_flag;
 	int ret;
 
 	path = btrfs_alloc_path();
 	if (!path)
 		return ERR_PTR(-ENOMEM);
 
+	nofs_flag = memalloc_nofs_save();
 	inode = new_inode(fs_info->sb);
+	memalloc_nofs_restore(nofs_flag);
 	if (!inode) {
 		btrfs_free_path(path);
 		return ERR_PTR(-ENOMEM);

btrfs: nofs inode allocations

Commit Message

Comments

Patch