[v2] btrfs: Fix race condition between delayed refs and blockgroup removal
diff mbox

Message ID 1523433549-5248-1-git-send-email-nborisov@suse.com
State New
Headers show

Commit Message

Nikolay Borisov April 11, 2018, 7:59 a.m. UTC
When the delayed refs for a head are all run, eventually
cleanup_ref_head is called which (in case of deletion) obtains a
reference for the relevant btrfs_space_info struct by querying the bg
for the range. This is problematic because when the last extent of a
bg is deleted a race window emerges between removal of that bg and the
subsequent invocation of cleanup_ref_head. This can result in cache being null
and either a null pointer dereference or assertion failure.

	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
	Call Trace:
	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
	 evict+0xc6/0x190
	 do_unlinkat+0x19c/0x300
	 do_syscall_64+0x74/0x140
	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
	RIP: 0033:0x7fbf589c57a7

To fix this, introduce a new flag "is_system" to head_ref structs,
which is populated at insertion time. This allows to decouple the
querying for the spaceinfo from querying the possibly deleted bg.

Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
Suggested-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---

v2: 
 * Collapsed the if/else logic in cleanup_ref_head

 fs/btrfs/delayed-ref.c | 19 ++++++++++++++-----
 fs/btrfs/delayed-ref.h |  1 +
 fs/btrfs/extent-tree.c | 18 +++++++++++++-----
 3 files changed, 28 insertions(+), 10 deletions(-)

Comments

Nikolay Borisov April 11, 2018, 8:08 a.m. UTC | #1
On 11.04.2018 10:59, Nikolay Borisov wrote:
> When the delayed refs for a head are all run, eventually
> cleanup_ref_head is called which (in case of deletion) obtains a
> reference for the relevant btrfs_space_info struct by querying the bg
> for the range. This is problematic because when the last extent of a
> bg is deleted a race window emerges between removal of that bg and the
> subsequent invocation of cleanup_ref_head. This can result in cache being null
> and either a null pointer dereference or assertion failure.
> 
> 	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
> 	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
> 	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
> 	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
> 	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
> 	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
> 	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
> 	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
> 	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
> 	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
> 	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 	Call Trace:
> 	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
> 	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
> 	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
> 	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
> 	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
> 	 evict+0xc6/0x190
> 	 do_unlinkat+0x19c/0x300
> 	 do_syscall_64+0x74/0x140
> 	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 	RIP: 0033:0x7fbf589c57a7
> 
> To fix this, introduce a new flag "is_system" to head_ref structs,
> which is populated at insertion time. This allows to decouple the
> querying for the spaceinfo from querying the possibly deleted bg.
> 
> Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
> Suggested-by: Omar Sandoval <osandov@osandov.com>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
> 
> v2: 
>  * Collapsed the if/else logic in cleanup_ref_head

So I had a bit written up about how I tested this code:

I did a full xfstest run with both the new code and old code and the
following assert:
  ASSERT(space_info == cache->space_info);

It didn't trigger, so I'm confident this shouldn't introduce any
regressions.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba April 16, 2018, 8:11 p.m. UTC | #2
On Wed, Apr 11, 2018 at 10:59:09AM +0300, Nikolay Borisov wrote:
> When the delayed refs for a head are all run, eventually
> cleanup_ref_head is called which (in case of deletion) obtains a
> reference for the relevant btrfs_space_info struct by querying the bg
> for the range. This is problematic because when the last extent of a
> bg is deleted a race window emerges between removal of that bg and the
> subsequent invocation of cleanup_ref_head. This can result in cache being null
> and either a null pointer dereference or assertion failure.
> 
> 	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
> 	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
> 	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
> 	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
> 	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
> 	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
> 	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
> 	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
> 	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
> 	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
> 	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 	Call Trace:
> 	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
> 	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
> 	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
> 	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
> 	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
> 	 evict+0xc6/0x190
> 	 do_unlinkat+0x19c/0x300
> 	 do_syscall_64+0x74/0x140
> 	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 	RIP: 0033:0x7fbf589c57a7
> 
> To fix this, introduce a new flag "is_system" to head_ref structs,
> which is populated at insertion time. This allows to decouple the
> querying for the spaceinfo from querying the possibly deleted bg.
> 
> Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
> Suggested-by: Omar Sandoval <osandov@osandov.com>
> Signed-off-by: Nikolay Borisov <nborisov@suse.com>
> ---
> 
> v2: 
>  * Collapsed the if/else logic in cleanup_ref_head
> 
>  fs/btrfs/delayed-ref.c | 19 ++++++++++++++-----
>  fs/btrfs/delayed-ref.h |  1 +
>  fs/btrfs/extent-tree.c | 18 +++++++++++++-----
>  3 files changed, 28 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 5ac2d7909782..20fa7b4c132d 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -550,8 +550,10 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>  		     struct btrfs_delayed_ref_head *head_ref,
>  		     struct btrfs_qgroup_extent_record *qrecord,
>  		     u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
> -		     int action, int is_data, int *qrecord_inserted_ret,
> +		     int action, int is_data, int is_system,
> +		     int *qrecord_inserted_ret,
>  		     int *old_ref_mod, int *new_ref_mod)
> +
>  {
>  	struct btrfs_delayed_ref_head *existing;
>  	struct btrfs_delayed_ref_root *delayed_refs;
> @@ -595,6 +597,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
>  	head_ref->ref_mod = count_mod;
>  	head_ref->must_insert_reserved = must_insert_reserved;
>  	head_ref->is_data = is_data;
> +	head_ref->is_system = is_system;
>  	head_ref->ref_tree = RB_ROOT;
>  	INIT_LIST_HEAD(&head_ref->ref_add_list);
>  	RB_CLEAR_NODE(&head_ref->href_node);
> @@ -782,6 +785,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
>  	struct btrfs_delayed_ref_root *delayed_refs;
>  	struct btrfs_qgroup_extent_record *record = NULL;
>  	int qrecord_inserted;
> +	int is_system = ref_root == BTRFS_CHUNK_TREE_OBJECTID;
>  
>  	BUG_ON(extent_op && extent_op->is_data);
>  	ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
> @@ -810,8 +814,8 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
>  	 */
>  	head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
>  					bytenr, num_bytes, 0, 0, action, 0,
> -					&qrecord_inserted, old_ref_mod,
> -					new_ref_mod);
> +					is_system, &qrecord_inserted,
> +					old_ref_mod, new_ref_mod);
>  
>  	add_delayed_tree_ref(fs_info, trans, head_ref, &ref->node, bytenr,
>  			     num_bytes, parent, ref_root, level, action);
> @@ -878,7 +882,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
>  	 */
>  	head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
>  					bytenr, num_bytes, ref_root, reserved,
> -					action, 1, &qrecord_inserted,
> +					action, 1, 0,  &qrecord_inserted,
>  					old_ref_mod, new_ref_mod);
>  
>  	add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
> @@ -908,9 +912,14 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
>  	delayed_refs = &trans->transaction->delayed_refs;
>  	spin_lock(&delayed_refs->lock);
>  
> +	/*
> +	 * extent_ops just modify the flags of an extent and they don't result
> +	 * in ref count changes, hence it's safe to pass false/0 for is_system
> +	 * argument
> +	 */
>  	add_delayed_ref_head(fs_info, trans, head_ref, NULL, bytenr,
>  			     num_bytes, 0, 0, BTRFS_UPDATE_DELAYED_HEAD,
> -			     extent_op->is_data, NULL, NULL, NULL);
> +			     extent_op->is_data, 0, NULL, NULL, NULL);
>  
>  	spin_unlock(&delayed_refs->lock);
>  	return 0;
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index cdf4a94ce5c1..7e7bf4e66d48 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -139,6 +139,7 @@ struct btrfs_delayed_ref_head {
>  	 */
>  	unsigned int must_insert_reserved:1;
>  	unsigned int is_data:1;
> +	unsigned int is_system:1;
>  	unsigned int processing:1;
>  };
>  
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 9a05e5b5089f..349b053680c8 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2616,13 +2616,21 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>  	trace_run_delayed_ref_head(fs_info, head, 0);
>  
>  	if (head->total_ref_mod < 0) {
> -		struct btrfs_block_group_cache *cache;
> +		struct btrfs_space_info *space_info;
>  
> -		cache = btrfs_lookup_block_group(fs_info, head->bytenr);
> -		ASSERT(cache);
> -		percpu_counter_add(&cache->space_info->total_bytes_pinned,
> +		if (head->is_data) {
> +			space_info = __find_space_info(fs_info,
> +						       BTRFS_BLOCK_GROUP_DATA);
> +		} else if (head->is_system) {
> +			space_info = __find_space_info(fs_info,
> +						BTRFS_BLOCK_GROUP_SYSTEM);
> +		} else {
> +			space_info = __find_space_info(fs_info,
> +						BTRFS_BLOCK_GROUP_METADATA);
> +		}

Could you please convert that to something like that:

		u64 flags

		if (is_system)
			flags = BTRFS_BLOCK_GROUP_SYSTEM;
		else if (is_data)
			etc

		__find_space_info(fs_info, flags);

This is to avoid repeating the same function call.

> +		ASSERT(space_info);
> +		percpu_counter_add(&space_info->total_bytes_pinned,
>  				   -head->num_bytes);
> -		btrfs_put_block_group(cache);
>  
>  		if (head->is_data) {
>  			spin_lock(&delayed_refs->lock);
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 5ac2d7909782..20fa7b4c132d 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -550,8 +550,10 @@  add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 		     struct btrfs_delayed_ref_head *head_ref,
 		     struct btrfs_qgroup_extent_record *qrecord,
 		     u64 bytenr, u64 num_bytes, u64 ref_root, u64 reserved,
-		     int action, int is_data, int *qrecord_inserted_ret,
+		     int action, int is_data, int is_system,
+		     int *qrecord_inserted_ret,
 		     int *old_ref_mod, int *new_ref_mod)
+
 {
 	struct btrfs_delayed_ref_head *existing;
 	struct btrfs_delayed_ref_root *delayed_refs;
@@ -595,6 +597,7 @@  add_delayed_ref_head(struct btrfs_fs_info *fs_info,
 	head_ref->ref_mod = count_mod;
 	head_ref->must_insert_reserved = must_insert_reserved;
 	head_ref->is_data = is_data;
+	head_ref->is_system = is_system;
 	head_ref->ref_tree = RB_ROOT;
 	INIT_LIST_HEAD(&head_ref->ref_add_list);
 	RB_CLEAR_NODE(&head_ref->href_node);
@@ -782,6 +785,7 @@  int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 	struct btrfs_delayed_ref_root *delayed_refs;
 	struct btrfs_qgroup_extent_record *record = NULL;
 	int qrecord_inserted;
+	int is_system = ref_root == BTRFS_CHUNK_TREE_OBJECTID;
 
 	BUG_ON(extent_op && extent_op->is_data);
 	ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
@@ -810,8 +814,8 @@  int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 	 */
 	head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
 					bytenr, num_bytes, 0, 0, action, 0,
-					&qrecord_inserted, old_ref_mod,
-					new_ref_mod);
+					is_system, &qrecord_inserted,
+					old_ref_mod, new_ref_mod);
 
 	add_delayed_tree_ref(fs_info, trans, head_ref, &ref->node, bytenr,
 			     num_bytes, parent, ref_root, level, action);
@@ -878,7 +882,7 @@  int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 	 */
 	head_ref = add_delayed_ref_head(fs_info, trans, head_ref, record,
 					bytenr, num_bytes, ref_root, reserved,
-					action, 1, &qrecord_inserted,
+					action, 1, 0,  &qrecord_inserted,
 					old_ref_mod, new_ref_mod);
 
 	add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
@@ -908,9 +912,14 @@  int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
 	delayed_refs = &trans->transaction->delayed_refs;
 	spin_lock(&delayed_refs->lock);
 
+	/*
+	 * extent_ops just modify the flags of an extent and they don't result
+	 * in ref count changes, hence it's safe to pass false/0 for is_system
+	 * argument
+	 */
 	add_delayed_ref_head(fs_info, trans, head_ref, NULL, bytenr,
 			     num_bytes, 0, 0, BTRFS_UPDATE_DELAYED_HEAD,
-			     extent_op->is_data, NULL, NULL, NULL);
+			     extent_op->is_data, 0, NULL, NULL, NULL);
 
 	spin_unlock(&delayed_refs->lock);
 	return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index cdf4a94ce5c1..7e7bf4e66d48 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -139,6 +139,7 @@  struct btrfs_delayed_ref_head {
 	 */
 	unsigned int must_insert_reserved:1;
 	unsigned int is_data:1;
+	unsigned int is_system:1;
 	unsigned int processing:1;
 };
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 9a05e5b5089f..349b053680c8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2616,13 +2616,21 @@  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
 	trace_run_delayed_ref_head(fs_info, head, 0);
 
 	if (head->total_ref_mod < 0) {
-		struct btrfs_block_group_cache *cache;
+		struct btrfs_space_info *space_info;
 
-		cache = btrfs_lookup_block_group(fs_info, head->bytenr);
-		ASSERT(cache);
-		percpu_counter_add(&cache->space_info->total_bytes_pinned,
+		if (head->is_data) {
+			space_info = __find_space_info(fs_info,
+						       BTRFS_BLOCK_GROUP_DATA);
+		} else if (head->is_system) {
+			space_info = __find_space_info(fs_info,
+						BTRFS_BLOCK_GROUP_SYSTEM);
+		} else {
+			space_info = __find_space_info(fs_info,
+						BTRFS_BLOCK_GROUP_METADATA);
+		}
+		ASSERT(space_info);
+		percpu_counter_add(&space_info->total_bytes_pinned,
 				   -head->num_bytes);
-		btrfs_put_block_group(cache);
 
 		if (head->is_data) {
 			spin_lock(&delayed_refs->lock);