diff mbox

[V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state

Message ID 1446452986-6581-1-git-send-email-chandan@linux.vnet.ibm.com (mailing list archive)
State Accepted
Headers show

Commit Message

Chandan Rajendra Nov. 2, 2015, 8:29 a.m. UTC
When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
and nodesize set to 64k), the following call trace is observed,

WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
Modules linked in:
CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54
task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d)
MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
CFAR: c0000000004a3b30 SOFTE: 1
GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
Call Trace:
[c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
[c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
[c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
[c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
[c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
[c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
[c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
[c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
[c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
[c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
[c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
[c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
[c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
[c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
[c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
[c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
[c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
[c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
[c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
[c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
[c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
[c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
[c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
[c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
[c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
Instruction dump:
fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c

The above call trace is seen even on x86_64; albeit very rarely and that too
with nodesize set to 64k and with nospace_cache mount option being used.

The reason for the above call trace is,
btrfs_remove_chunk
  check_system_chunk
    Allocate chunk if required
  For each physical stripe on underlying device,
    btrfs_free_dev_extent
      ...
      Take lock on Device tree's root node
      btrfs_cow_block("dev tree's root node");
        btrfs_reserve_extent
          find_free_extent
	    index = BTRFS_RAID_DUP;
	    have_caching_bg = false;

            When in LOOP_CACHING_NOWAIT state, Assume we find a block group
	    which is being cached; Hence have_caching_bg is set to true

            When repeating the search for the next RAID index, we set
	    have_caching_bg to false.

Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
allocate a chunk and try to add entries corresponding to the chunk's physical
stripe into the device tree. When doing so the task deadlocks itself waiting
for the blocking lock on the root node of the device tree.

This commit fixes the issue by introducing a new local variable to help
indicate as to whether a block group of any RAID type is being cached.

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
---
Changelog:
v1->v2: Honor 80 column restriction.

 fs/btrfs/extent-tree.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Josef Bacik Nov. 2, 2015, 3:46 p.m. UTC | #1
On 11/02/2015 03:29 AM, Chandan Rajendra wrote:
> When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
> and nodesize set to 64k), the following call trace is observed,
>
> WARNING: at /root/repos/linux/fs/btrfs/locking.c:253
> Modules linked in:
> CPU: 2 PID: 8353 Comm: umount Not tainted 4.3.0-rc5-13676-ga5e681d #54
> task: c0000000f2b1f560 ti: c0000000f6008000 task.ti: c0000000f6008000
> NIP: c000000000520c88 LR: c0000000004a3b34 CTR: 0000000000000000
> REGS: c0000000f600a820 TRAP: 0700   Not tainted  (4.3.0-rc5-13676-ga5e681d)
> MSR: 8000000102029032 <SF,VEC,EE,ME,IR,DR,RI>  CR: 24444884  XER: 00000000
> CFAR: c0000000004a3b30 SOFTE: 1
> GPR00: c0000000004a3b34 c0000000f600aaa0 c00000000108ac00 c0000000f5a808c0
> GPR04: 0000000000000000 c0000000f600ae60 0000000000000000 0000000000000005
> GPR08: 00000000000020a1 0000000000000001 c0000000f2b1f560 0000000000000030
> GPR12: 0000000084842882 c00000000fdc0900 c0000000f600ae60 c0000000f070b800
> GPR16: 0000000000000000 c0000000f3c8a000 0000000000000000 0000000000000049
> GPR20: 0000000000000001 0000000000000001 c0000000f5aa01f8 0000000000000000
> GPR24: 0f83e0f83e0f83e1 c0000000f5a808c0 c0000000f3c8d000 c000000000000000
> GPR28: c0000000f600ae74 0000000000000001 c0000000f3c8d000 c0000000f5a808c0
> NIP [c000000000520c88] .btrfs_tree_lock+0x48/0x2a0
> LR [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
> Call Trace:
> [c0000000f600aaa0] [c0000000f600ab80] 0xc0000000f600ab80 (unreliable)
> [c0000000f600ab80] [c0000000004a3b34] .btrfs_lock_root_node+0x44/0x80
> [c0000000f600ac00] [c0000000004a99dc] .btrfs_search_slot+0xa8c/0xc00
> [c0000000f600ad40] [c0000000004ab878] .btrfs_insert_empty_items+0x98/0x120
> [c0000000f600adf0] [c00000000050da44] .btrfs_finish_chunk_alloc+0x1d4/0x620
> [c0000000f600af20] [c0000000004be854] .btrfs_create_pending_block_groups+0x1d4/0x2c0
> [c0000000f600b020] [c0000000004bf188] .do_chunk_alloc+0x3c8/0x420
> [c0000000f600b100] [c0000000004c27cc] .find_free_extent+0xbfc/0x1030
> [c0000000f600b260] [c0000000004c2ce8] .btrfs_reserve_extent+0xe8/0x250
> [c0000000f600b330] [c0000000004c2f90] .btrfs_alloc_tree_block+0x140/0x590
> [c0000000f600b440] [c0000000004a47b4] .__btrfs_cow_block+0x124/0x780
> [c0000000f600b530] [c0000000004a4fc0] .btrfs_cow_block+0xf0/0x250
> [c0000000f600b5e0] [c0000000004a917c] .btrfs_search_slot+0x22c/0xc00
> [c0000000f600b720] [c00000000050aa40] .btrfs_remove_chunk+0x1b0/0x9f0
> [c0000000f600b850] [c0000000004c4e04] .btrfs_delete_unused_bgs+0x434/0x570
> [c0000000f600b950] [c0000000004d3cb8] .close_ctree+0x2e8/0x3b0
> [c0000000f600ba20] [c00000000049d178] .btrfs_put_super+0x18/0x30
> [c0000000f600ba90] [c000000000243cd4] .generic_shutdown_super+0xa4/0x1a0
> [c0000000f600bb10] [c0000000002441d8] .kill_anon_super+0x18/0x30
> [c0000000f600bb90] [c00000000049c898] .btrfs_kill_super+0x18/0xc0
> [c0000000f600bc10] [c0000000002444f8] .deactivate_locked_super+0x98/0xe0
> [c0000000f600bc90] [c000000000269f94] .cleanup_mnt+0x54/0xa0
> [c0000000f600bd10] [c0000000000bd744] .task_work_run+0xc4/0x100
> [c0000000f600bdb0] [c000000000016334] .do_notify_resume+0x74/0x80
> [c0000000f600be30] [c0000000000098b8] .ret_from_except_lite+0x64/0x68
> Instruction dump:
> fba1ffe8 fbc1fff0 fbe1fff8 7c791b78 f8010010 f821ff21 e94d0290 81030040
> 812a04e8 7d094a78 7d290034 5529d97e <0b090000> 3b400000 3be30050 3bc3004c
>
> The above call trace is seen even on x86_64; albeit very rarely and that too
> with nodesize set to 64k and with nospace_cache mount option being used.
>
> The reason for the above call trace is,
> btrfs_remove_chunk
>    check_system_chunk
>      Allocate chunk if required
>    For each physical stripe on underlying device,
>      btrfs_free_dev_extent
>        ...
>        Take lock on Device tree's root node
>        btrfs_cow_block("dev tree's root node");
>          btrfs_reserve_extent
>            find_free_extent
> 	    index = BTRFS_RAID_DUP;
> 	    have_caching_bg = false;
>
>              When in LOOP_CACHING_NOWAIT state, Assume we find a block group
> 	    which is being cached; Hence have_caching_bg is set to true
>
>              When repeating the search for the next RAID index, we set
> 	    have_caching_bg to false.
>
> Hence right after completing the LOOP_CACHING_NOWAIT state, we incorrectly
> skip LOOP_CACHING_WAIT state and move to LOOP_ALLOC_CHUNK state where we
> allocate a chunk and try to add entries corresponding to the chunk's physical
> stripe into the device tree. When doing so the task deadlocks itself waiting
> for the blocking lock on the root node of the device tree.
>
> This commit fixes the issue by introducing a new local variable to help
> indicate as to whether a block group of any RAID type is being cached.
>
> Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
> ---
> Changelog:
> v1->v2: Honor 80 column restriction.
>
>   fs/btrfs/extent-tree.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f50c7c2..99a8e57 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7029,6 +7029,7 @@ static noinline int find_free_extent(struct btrfs_root *orig_root,
>   	bool failed_alloc = false;
>   	bool use_cluster = true;
>   	bool have_caching_bg = false;
> +	bool orig_have_caching_bg = false;
>   	bool full_search = false;
>
>   	WARN_ON(num_bytes < root->sectorsize);
> @@ -7378,6 +7379,10 @@ loop:
>   	}
>   	up_read(&space_info->groups_sem);
>
> +	if ((loop == LOOP_CACHING_NOWAIT) && have_caching_bg
> +		&& !orig_have_caching_bg)
> +		orig_have_caching_bg = true;
> +
>   	if (!ins->objectid && loop >= LOOP_CACHING_WAIT && have_caching_bg)
>   		goto search;
>
> @@ -7400,7 +7405,7 @@ loop:
>   			 * don't have any unached bgs and we've alrelady done a
>   			 * full search through.
>   			 */
> -			if (have_caching_bg || !full_search)
> +			if (orig_have_caching_bg || !full_search)
>   				loop = LOOP_CACHING_WAIT;
>   			else
>   				loop = LOOP_ALLOC_CHUNK;
>

Ah good catch, thanks Chandan,

Reviewed-by: Josef Bacik <jbacik@fb.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Nov. 2, 2015, 4:52 p.m. UTC | #2
On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote:
> When executing generic/001 in a loop on a ppc64 machine (with both sectorsize
> and nodesize set to 64k), the following call trace is observed,

Thanks Chandan, I hit this same trace on x86-64 with 16K nodes.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas Dec. 13, 2015, 10:18 a.m. UTC | #3
[Resending in plain text, apologies.]

Hi Chandan, Josef, Chris,

I am not sure I understand the fix to the problem.

It may happen that when updating the device tree, we need to allocate a new
chunk via do_chunk_alloc (while we are holding the device tree root node
locked). This is a legitimate thing for find_free_extent() to do. And
do_chunk_alloc() call may lead to call to
btrfs_create_pending_block_groups(), which will try to update the device
tree. This may happen due to direct call to
btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or
perhaps by __btrfs_end_transaction() that find_free_extent() does after it
completed chunk allocation (although in this case it will use the
transaction that already exists in current->journal_info).
So the deadlock still may happen?

Thanks,
 Alex.

>
>
> On Mon, Nov 2, 2015 at 6:52 PM, Chris Mason <clm@fb.com> wrote:
>>
>> On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote:
>> > When executing generic/001 in a loop on a ppc64 machine (with both
>> > sectorsize
>> > and nodesize set to 64k), the following call trace is observed,
>>
>> Thanks Chandan, I hit this same trace on x86-64 with 16K nodes.
>>
>> -chris
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chandan Rajendra Dec. 14, 2015, 7:59 a.m. UTC | #4
On Sunday 13 Dec 2015 12:18:55 Alex Lyakas wrote:
> [Resending in plain text, apologies.]
> 
> Hi Chandan, Josef, Chris,
> 
> I am not sure I understand the fix to the problem.
> 
> It may happen that when updating the device tree, we need to allocate a new
> chunk via do_chunk_alloc (while we are holding the device tree root node
> locked). This is a legitimate thing for find_free_extent() to do. And
> do_chunk_alloc() call may lead to call to
> btrfs_create_pending_block_groups(), which will try to update the device
> tree. This may happen due to direct call to
> btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or
> perhaps by __btrfs_end_transaction() that find_free_extent() does after it
> completed chunk allocation (although in this case it will use the
> transaction that already exists in current->journal_info).
> So the deadlock still may happen?

Hello Alex,

The "global block reservation" (see btrfs_fs_info->global_block_rsv) aims to
solve this problem. I don't claim to have understood the behaviour of
global_block_rsv completely. However, Global block reservation makes sure that
we have enough free space reserved (see update_global_block_rsv()) for future
operations on,
- Extent tree
- Checksum tree
- Device tree 
- Tree root tree and
- Quota tree.

Tasks changing the device tree should get their space requirements satisfied
from the global block reservation. Hence such changes to the device tree
should not end up forcing find_free_extent() to allocate a new chunk.
diff mbox

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f50c7c2..99a8e57 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7029,6 +7029,7 @@  static noinline int find_free_extent(struct btrfs_root *orig_root,
 	bool failed_alloc = false;
 	bool use_cluster = true;
 	bool have_caching_bg = false;
+	bool orig_have_caching_bg = false;
 	bool full_search = false;
 
 	WARN_ON(num_bytes < root->sectorsize);
@@ -7378,6 +7379,10 @@  loop:
 	}
 	up_read(&space_info->groups_sem);
 
+	if ((loop == LOOP_CACHING_NOWAIT) && have_caching_bg
+		&& !orig_have_caching_bg)
+		orig_have_caching_bg = true;
+
 	if (!ins->objectid && loop >= LOOP_CACHING_WAIT && have_caching_bg)
 		goto search;
 
@@ -7400,7 +7405,7 @@  loop:
 			 * don't have any unached bgs and we've alrelady done a
 			 * full search through.
 			 */
-			if (have_caching_bg || !full_search)
+			if (orig_have_caching_bg || !full_search)
 				loop = LOOP_CACHING_WAIT;
 			else
 				loop = LOOP_ALLOC_CHUNK;