Btrfs: fix deadlock when writing out free space caches
diff mbox series

Message ID 20181012090355.25755-1-fdmanana@kernel.org
State New
Headers show
Series
  • Btrfs: fix deadlock when writing out free space caches
Related show

Commit Message

Filipe Manana Oct. 12, 2018, 9:03 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

When writing out a block group free space cache we can end deadlocking
with ourseves on an extent buffer lock resulting in a warning like the
following:

  [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
  [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
    W I      4.16.8 #1
  [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
  [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
  [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
  [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
  [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
  [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
  [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
  [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
  [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
  [245043.408670] Call Trace:
  [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
  [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
  [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
  [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
  [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
  [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
  [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
  [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
  [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
  [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
  [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
  [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
  [245043.425538]  ? iput+0x72/0x1e0
  [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
  [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
  [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
  [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
  [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
  [245043.433341]  kthread+0xf0/0x130
  [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
  [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
  [245043.437236]  ret_from_fork+0x1f/0x30
  [245043.441054] ---[ end trace 15abaa2aaf36827f ]---

This is because at write_one_cache_group() when we are COWing a leaf from
the extent tree we end up allocating a new block group (chunk) and,
because we have hit a threshold on the number of bytes reserved for system
chunks, we attempt to finalize the creation of new block groups from the
current transaction, by calling btrfs_create_pending_block_groups().
However here we also need to modify the extent tree in order to insert
a block group item, and if the location for this new block group item
happens to be in the same leaf that we were COWing earlier, we deadlock
since btrfs_search_slot() tries to write lock the extent buffer that we
locked before at write_one_cache_group().

We have already hit similar cases in the past and commit d9a0540a79f8
("Btrfs: fix deadlock when finalizing block group creation") fixed some
of those cases by delaying the creation of pending block groups at the
known specific spots that could lead to a deadlock. This change reworks
that commit to be more generic so that we don't have to add similar logic
to every possible path that can lead to a deadlock. This is done by
making __btrfs_cow_block() disallowing the creation of new block groups
(setting the transaction's can_flush_pending_bgs to false) before it
attempts to allocate a new extent buffer for either the extent, chunk or
device trees, since those are the trees that pending block creation
modifies. Once the new extent buffer is allocated, it allows creation of
pending block groups to happen again.

This change depends on a recent patch from Josef which is not yet in
Linus' tree, named "btrfs: make sure we create all new block groups" in
order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().

Fixes: d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/
Reported-by: E V <eliventer@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.c       | 17 +++++++++++++++++
 fs/btrfs/extent-tree.c | 16 ++++++----------
 2 files changed, 23 insertions(+), 10 deletions(-)

Comments

Josef Bacik Oct. 12, 2018, 1:40 p.m. UTC | #1
On Fri, Oct 12, 2018 at 10:03:55AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When writing out a block group free space cache we can end deadlocking
> with ourseves on an extent buffer lock resulting in a warning like the
> following:
> 
>   [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
>   [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
>     W I      4.16.8 #1
>   [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
>   [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
>   [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
>   [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
>   [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
>   [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
>   [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
>   [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
>   [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
>   [245043.408670] Call Trace:
>   [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
>   [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
>   [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
>   [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
>   [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
>   [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
>   [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
>   [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
>   [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
>   [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
>   [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
>   [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
>   [245043.425538]  ? iput+0x72/0x1e0
>   [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
>   [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
>   [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
>   [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
>   [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
>   [245043.433341]  kthread+0xf0/0x130
>   [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
>   [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
>   [245043.437236]  ret_from_fork+0x1f/0x30
>   [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
> 
> This is because at write_one_cache_group() when we are COWing a leaf from
> the extent tree we end up allocating a new block group (chunk) and,
> because we have hit a threshold on the number of bytes reserved for system
> chunks, we attempt to finalize the creation of new block groups from the
> current transaction, by calling btrfs_create_pending_block_groups().
> However here we also need to modify the extent tree in order to insert
> a block group item, and if the location for this new block group item
> happens to be in the same leaf that we were COWing earlier, we deadlock
> since btrfs_search_slot() tries to write lock the extent buffer that we
> locked before at write_one_cache_group().
> 
> We have already hit similar cases in the past and commit d9a0540a79f8
> ("Btrfs: fix deadlock when finalizing block group creation") fixed some
> of those cases by delaying the creation of pending block groups at the
> known specific spots that could lead to a deadlock. This change reworks
> that commit to be more generic so that we don't have to add similar logic
> to every possible path that can lead to a deadlock. This is done by
> making __btrfs_cow_block() disallowing the creation of new block groups
> (setting the transaction's can_flush_pending_bgs to false) before it
> attempts to allocate a new extent buffer for either the extent, chunk or
> device trees, since those are the trees that pending block creation
> modifies. Once the new extent buffer is allocated, it allows creation of
> pending block groups to happen again.
> 
> This change depends on a recent patch from Josef which is not yet in
> Linus' tree, named "btrfs: make sure we create all new block groups" in
> order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
> 
> Fixes: d9a0540a79f8 ("Btrfs: fix deadlock when finalizing block group creation")
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
> Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/
> Reported-by: E V <eliventer@gmail.com>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef
David Sterba Oct. 17, 2018, 2:26 p.m. UTC | #2
On Fri, Oct 12, 2018 at 10:03:55AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When writing out a block group free space cache we can end deadlocking
> with ourseves on an extent buffer lock resulting in a warning like the
> following:
> 
>   [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
>   [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
>     W I      4.16.8 #1
>   [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
>   [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
>   [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
>   [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
>   [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
>   [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
>   [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
>   [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
>   [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
>   [245043.408670] Call Trace:
>   [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
>   [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
>   [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
>   [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
>   [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
>   [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
>   [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
>   [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
>   [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
>   [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
>   [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
>   [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
>   [245043.425538]  ? iput+0x72/0x1e0
>   [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
>   [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
>   [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
>   [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
>   [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
>   [245043.433341]  kthread+0xf0/0x130
>   [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
>   [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
>   [245043.437236]  ret_from_fork+0x1f/0x30
>   [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
> 
> This is because at write_one_cache_group() when we are COWing a leaf from
> the extent tree we end up allocating a new block group (chunk) and,
> because we have hit a threshold on the number of bytes reserved for system
> chunks, we attempt to finalize the creation of new block groups from the
> current transaction, by calling btrfs_create_pending_block_groups().
> However here we also need to modify the extent tree in order to insert
> a block group item, and if the location for this new block group item
> happens to be in the same leaf that we were COWing earlier, we deadlock
> since btrfs_search_slot() tries to write lock the extent buffer that we
> locked before at write_one_cache_group().
> 
> We have already hit similar cases in the past and commit d9a0540a79f8
> ("Btrfs: fix deadlock when finalizing block group creation") fixed some
> of those cases by delaying the creation of pending block groups at the
> known specific spots that could lead to a deadlock. This change reworks
> that commit to be more generic so that we don't have to add similar logic
> to every possible path that can lead to a deadlock. This is done by
> making __btrfs_cow_block() disallowing the creation of new block groups
> (setting the transaction's can_flush_pending_bgs to false) before it
> attempts to allocate a new extent buffer for either the extent, chunk or
> device trees, since those are the trees that pending block creation
> modifies. Once the new extent buffer is allocated, it allows creation of
> pending block groups to happen again.
> 
> This change depends on a recent patch from Josef which is not yet in
> Linus' tree, named "btrfs: make sure we create all new block groups" in
> order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().

Thanks for mentioning it, the referenced patch has been in misc-next and
is scheduled for 4.20 so your patch will be added too.

Patch
diff mbox series

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d436fb4c002e..089b46c4d97f 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1050,9 +1050,26 @@  static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 	if ((root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) && parent)
 		parent_start = parent->start;
 
+	/*
+	 * If we are COWing a node/leaf from the extent, chunk or device trees,
+	 * make sure that we do not finish block group creation of pending block
+	 * groups. We do this to avoid a deadlock.
+	 * COWing can result in allocation of a new chunk, and flushing pending
+	 * block groups (btrfs_create_pending_block_groups()) can be triggered
+	 * when finishing allocation of a new chunk. Creation of a pending block
+	 * group modifies the extent, chunk and device trees, therefore we could
+	 * deadlock with ourselves since we are holding a lock on an extent
+	 * buffer that btrfs_create_pending_block_groups() may try to COW later.
+	 */
+	if (root == fs_info->extent_root ||
+	    root == fs_info->chunk_root ||
+	    root == fs_info->dev_root)
+		trans->can_flush_pending_bgs = false;
+
 	cow = btrfs_alloc_tree_block(trans, root, parent_start,
 			root->root_key.objectid, &disk_key, level,
 			search_start, empty_size);
+	trans->can_flush_pending_bgs = true;
 	if (IS_ERR(cow))
 		return PTR_ERR(cow);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2f9748599268..577878324799 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2911,7 +2911,6 @@  int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 	struct btrfs_delayed_ref_head *head;
 	int ret;
 	int run_all = count == (unsigned long)-1;
-	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
 	/* We'll clean this up in btrfs_cleanup_transaction */
 	if (trans->aborted)
@@ -2928,7 +2927,6 @@  int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 #ifdef SCRAMBLE_DELAYED_REFS
 	delayed_refs->run_delayed_start = find_middle(&delayed_refs->root);
 #endif
-	trans->can_flush_pending_bgs = false;
 	ret = __btrfs_run_delayed_refs(trans, count);
 	if (ret < 0) {
 		btrfs_abort_transaction(trans, ret);
@@ -2959,7 +2957,6 @@  int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans,
 		goto again;
 	}
 out:
-	trans->can_flush_pending_bgs = can_flush_pending_bgs;
 	return 0;
 }
 
@@ -4554,11 +4551,9 @@  static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags,
 	 * the block groups that were made dirty during the lifetime of the
 	 * transaction.
 	 */
-	if (trans->can_flush_pending_bgs &&
-	    trans->chunk_bytes_reserved >= (u64)SZ_2M) {
+	if (trans->chunk_bytes_reserved >= (u64)SZ_2M)
 		btrfs_create_pending_block_groups(trans);
-		btrfs_trans_release_chunk_metadata(trans);
-	}
+
 	return ret;
 }
 
@@ -10079,9 +10074,10 @@  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 	struct btrfs_block_group_item item;
 	struct btrfs_key key;
 	int ret = 0;
-	bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
-	trans->can_flush_pending_bgs = false;
+	if (!trans->can_flush_pending_bgs)
+		return;
+
 	while (!list_empty(&trans->new_bgs)) {
 		block_group = list_first_entry(&trans->new_bgs,
 					       struct btrfs_block_group_cache,
@@ -10106,7 +10102,7 @@  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 next:
 		list_del_init(&block_group->bg_list);
 	}
-	trans->can_flush_pending_bgs = can_flush_pending_bgs;
+	btrfs_trans_release_chunk_metadata(trans);
 }
 
 int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,