diff mbox

Btrfs: fix a deadlock on chunk mutex

Message ID 20121218135242.GC2403@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Josef Bacik Dec. 18, 2012, 1:52 p.m. UTC
On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
> An user reported that he has hit an annoying deadlock while playing with
> ceph based on btrfs.
> 
> Current updating device tree requires space from METADATA chunk,
> so we -may- need to do a recursive chunk allocation when adding/updating
> dev extent, that is where the deadlock comes from.
> 
> If we use SYSTEM metadata to update device tree, we can avoid the recursive
> stuff.
> 

This is going to cause us to allocate much more system chunks than we used to
which could land us in trouble.  Instead let's just keep us from re-entering if
we're already allocating a chunk.  We do the chunk allocation when we don't have
enough space for a cluster, but we'll likely have plenty of space to make an
allocation.  Can you give this patch a try Jim and see if it fixes your problem?
Thanks,

Josef


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Liu Bo Dec. 18, 2012, 2:47 p.m. UTC | #1
On Tue, Dec 18, 2012 at 08:52:42AM -0500, Josef Bacik wrote:
> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
> > An user reported that he has hit an annoying deadlock while playing with
> > ceph based on btrfs.
> > 
> > Current updating device tree requires space from METADATA chunk,
> > so we -may- need to do a recursive chunk allocation when adding/updating
> > dev extent, that is where the deadlock comes from.
> > 
> > If we use SYSTEM metadata to update device tree, we can avoid the recursive
> > stuff.
> > 
> 
> This is going to cause us to allocate much more system chunks than we used to
> which could land us in trouble.  Instead let's just keep us from re-entering if
> we're already allocating a chunk.  We do the chunk allocation when we don't have
> enough space for a cluster, but we'll likely have plenty of space to make an
> allocation.  Can you give this patch a try Jim and see if it fixes your problem?
> Thanks,

From the stack info Jim gave, returning ENOSPC to caller will end up with
aborting to readonly if there is no others save the situation by 
allocating another METADATA chunk, it is recursive allocation though.

thanks,
liubo

> 
> Josef
> 
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e152809..59df5e7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3564,6 +3564,10 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
>  	int wait_for_alloc = 0;
>  	int ret = 0;
>  
> +	/* Don't re-enter if we're already allocating a chunk */
> +	if (trans->allocating_chunk)
> +		return -ENOSPC;
> +
>  	space_info = __find_space_info(extent_root->fs_info, flags);
>  	if (!space_info) {
>  		ret = update_space_info(extent_root->fs_info, flags,
> @@ -3606,6 +3610,8 @@ again:
>  		goto again;
>  	}
>  
> +	trans->allocating_chunk = true;
> +
>  	/*
>  	 * If we have mixed data/metadata chunks we want to make sure we keep
>  	 * allocating mixed chunks instead of individual chunks.
> @@ -3632,6 +3638,7 @@ again:
>  	check_system_chunk(trans, extent_root, flags);
>  
>  	ret = btrfs_alloc_chunk(trans, extent_root, flags);
> +	trans->allocating_chunk = false;
>  	if (ret < 0 && ret != -ENOSPC)
>  		goto out;
>  
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index e6509b9..47ad8be 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -388,6 +388,7 @@ again:
>  	h->qgroup_reserved = qgroup_reserved;
>  	h->delayed_ref_elem.seq = 0;
>  	h->type = type;
> +	h->allocating_chunk = false;
>  	INIT_LIST_HEAD(&h->qgroup_ref_list);
>  	INIT_LIST_HEAD(&h->new_bgs);
>  
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0e8aa1e..69700f7 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -68,6 +68,7 @@ struct btrfs_trans_handle {
>  	struct btrfs_block_rsv *orig_rsv;
>  	short aborted;
>  	short adding_csums;
> +	bool allocating_chunk;
>  	enum btrfs_trans_type type;
>  	/*
>  	 * this root is only needed to validate that the root passed to
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Dec. 18, 2012, 3:40 p.m. UTC | #2
On Tue, Dec 18, 2012 at 07:47:51AM -0700, Liu Bo wrote:
> On Tue, Dec 18, 2012 at 08:52:42AM -0500, Josef Bacik wrote:
> > On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
> > > An user reported that he has hit an annoying deadlock while playing with
> > > ceph based on btrfs.
> > > 
> > > Current updating device tree requires space from METADATA chunk,
> > > so we -may- need to do a recursive chunk allocation when adding/updating
> > > dev extent, that is where the deadlock comes from.
> > > 
> > > If we use SYSTEM metadata to update device tree, we can avoid the recursive
> > > stuff.
> > > 
> > 
> > This is going to cause us to allocate much more system chunks than we used to
> > which could land us in trouble.  Instead let's just keep us from re-entering if
> > we're already allocating a chunk.  We do the chunk allocation when we don't have
> > enough space for a cluster, but we'll likely have plenty of space to make an
> > allocation.  Can you give this patch a try Jim and see if it fixes your problem?
> > Thanks,
> 
> From the stack info Jim gave, returning ENOSPC to caller will end up with
> aborting to readonly if there is no others save the situation by 
> allocating another METADATA chunk, it is recursive allocation though.
> 

if (ret < 0 && ret != -ENOSPC)

it shouldn't abort, it should just drop empty_size and stop trying to allocate a
cluster and just allocate the blocks needed, and this is only for the recursive
chunk allocation, so after this succeeds we'll have a new chunk and the original
allocation will be able to carry on.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jim Schutt Jan. 3, 2013, 6:44 p.m. UTC | #3
Hi Josef,

Thanks for the patch - sorry for the long delay in testing...


On 12/18/2012 06:52 AM, Josef Bacik wrote:
> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
>> An user reported that he has hit an annoying deadlock while playing with
>> ceph based on btrfs.
>>
>> Current updating device tree requires space from METADATA chunk,
>> so we -may- need to do a recursive chunk allocation when adding/updating
>> dev extent, that is where the deadlock comes from.
>>
>> If we use SYSTEM metadata to update device tree, we can avoid the recursive
>> stuff.
>>
> 
> This is going to cause us to allocate much more system chunks than we used to
> which could land us in trouble.  Instead let's just keep us from re-entering if
> we're already allocating a chunk.  We do the chunk allocation when we don't have
> enough space for a cluster, but we'll likely have plenty of space to make an
> allocation.  Can you give this patch a try Jim and see if it fixes your problem?
> Thanks,
> 
> Josef
> 

With your patch applied to 3.7.1, I get the following on one
of my servers running Ceph OSDs.  The end effect is that some
of my ceph client writes hang. 

[ 1440.335752] ------------[ cut here ]------------
[ 1440.340602] WARNING: at fs/btrfs/super.c:246 __btrfs_abort_transaction+0x60/0x110 [btrfs]()
[ 1440.349117] Hardware name: X8DTH-i/6/iF/6F
[ 1440.353252] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod iTCO_wdt iTCO_vendor_support hid_generic button ata_piix libata coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000
[ 1440.419398] Pid: 48686, comm: ceph-osd Not tainted 3.7.1-00006-gc794580 #484
[ 1440.426614] Call Trace:
[ 1440.429083]  [<ffffffff8103fed4>] warn_slowpath_common+0x94/0xc0
[ 1440.435110]  [<ffffffff8103ffb6>] warn_slowpath_fmt+0x46/0x50
[ 1440.440894]  [<ffffffffa05425c0>] __btrfs_abort_transaction+0x60/0x110 [btrfs]
[ 1440.448135]  [<ffffffffa059513d>] __btrfs_alloc_chunk+0x6cd/0x750 [btrfs]
[ 1440.454941]  [<ffffffffa059521e>] btrfs_alloc_chunk+0x5e/0x90 [btrfs]
[ 1440.461382]  [<ffffffffa05543a1>] ? check_system_chunk+0x71/0x130 [btrfs]
[ 1440.468188]  [<ffffffffa055474c>] do_chunk_alloc+0x2ec/0x370 [btrfs]
[ 1440.474562]  [<ffffffffa05509e9>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs]
[ 1440.482050]  [<ffffffffa055839c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs]
[ 1440.489558]  [<ffffffffa0559f40>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs]
[ 1440.497013]  [<ffffffffa057e31e>] __btrfs_buffered_write+0x15e/0x350 [btrfs]
[ 1440.504095]  [<ffffffffa057e849>] btrfs_file_aio_write+0x209/0x320 [btrfs]
[ 1440.511000]  [<ffffffffa057e640>] ? __btrfs_direct_write+0x130/0x130 [btrfs]
[ 1440.518062]  [<ffffffff81164ef4>] do_sync_readv_writev+0x94/0xe0
[ 1440.524105]  [<ffffffff81165f03>] do_readv_writev+0xe3/0x1e0
[ 1440.529792]  [<ffffffff81182ff2>] ? fget_light+0x122/0x170
[ 1440.535275]  [<ffffffff81166046>] vfs_writev+0x46/0x60
[ 1440.540412]  [<ffffffff8116617f>] sys_writev+0x5f/0xc0
[ 1440.545547]  [<ffffffff81264b3e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1440.551987]  [<ffffffff814b7102>] system_call_fastpath+0x16/0x1b
[ 1440.558016] ---[ end trace 764e83a458dabca6 ]---
[ 1440.562662] BTRFS warning (device dm-32): __btrfs_alloc_chunk:3488: Aborting unused transaction(error 28).
[ 1440.595987] BTRFS warning (device dm-32): find_free_extent:5871: Aborting unused transaction(Object already exists).
[ 1440.606542] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 1440.614382] IP: [<ffffffffa0584e5e>] map_private_extent_buffer+0xe/0xf0 [btrfs]
[ 1440.621704] PGD 6138e8067 PUD 56749f067 PMD 0 
[ 1440.626190] Oops: 0000 [#1] SMP 
[ 1440.629442] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod iTCO_wdt iTCO_vendor_support hid_generic button ata_piix libata coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000
[ 1440.694855] CPU 16 
[ 1440.696784] Pid: 48687, comm: ceph-osd Tainted: G        W    3.7.1-00006-gc794580 #484 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 1440.707803] RIP: 0010:[<ffffffffa0584e5e>]  [<ffffffffa0584e5e>] map_private_extent_buffer+0xe/0xf0 [btrfs]
[ 1440.717544] RSP: 0018:ffff880b740db9f8  EFLAGS: 00010292
[ 1440.722841] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880b740dba28
[ 1440.729947] RDX: 0000000000000004 RSI: 0000000000000076 RDI: 0000000000000000
[ 1440.737055] RBP: ffff880b740dba08 R08: ffff880b740dba20 R09: ffff880b740dba18
[ 1440.744167] R10: ffff88092bba8000 R11: ffff880a4138c320 R12: 0000000000000000
[ 1440.751280] R13: 0000000000000065 R14: 0000000000000011 R15: 0000000000000076
[ 1440.758395] FS:  00007fffeb4c3700(0000) GS:ffff880627d40000(0000) knlGS:0000000000000000
[ 1440.766460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1440.772188] CR2: 0000000000000000 CR3: 00000004bd2a4000 CR4: 00000000000007e0
[ 1440.779303] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1440.786416] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1440.793523] Process ceph-osd (pid: 48687, threadinfo ffff880b740da000, task ffff8808f801bec0)
[ 1440.802018] Stack:
[ 1440.804030]  ffff880b740dbb98 0000000000000000 ffff880b740dba68 ffffffffa0581e3c
[ 1440.811464]  ffff880977dbd030 ffff880c00000002 ffff8808f801c5f0 0000000000000053
[ 1440.818897]  ffff880b740dbae4 ffff880612084c60 0000000000000000 ffff880612084c60
[ 1440.826330] Call Trace:
[ 1440.828800]  [<ffffffffa0581e3c>] btrfs_get_token_32+0x8c/0xf0 [btrfs]
[ 1440.835327]  [<ffffffffa056042d>] btrfs_match_dir_item_name+0x4d/0x140 [btrfs]
[ 1440.842545]  [<ffffffffa0560919>] insert_with_overflow+0x59/0x120 [btrfs]
[ 1440.849315]  [<ffffffffa0560ca6>] btrfs_insert_xattr_item+0xb6/0x1d0 [btrfs]
[ 1440.856343]  [<ffffffffa056d279>] ? join_transaction+0x29/0x370 [btrfs]
[ 1440.862945]  [<ffffffffa056d30f>] ? join_transaction+0xbf/0x370 [btrfs]
[ 1440.869536]  [<ffffffff81159ac3>] ? kmem_cache_alloc+0xd3/0x170
[ 1440.875450]  [<ffffffffa0582b3a>] do_setxattr+0x17a/0x240 [btrfs]
[ 1440.881534]  [<ffffffffa0582c8b>] __btrfs_setxattr+0x8b/0x110 [btrfs]
[ 1440.887965]  [<ffffffffa0582f27>] btrfs_setxattr+0xa7/0xc0 [btrfs]
[ 1440.894130]  [<ffffffff8118a19b>] __vfs_setxattr_noperm+0x7b/0x150
[ 1440.900287]  [<ffffffff8118a2fe>] vfs_setxattr+0x8e/0xc0
[ 1440.905591]  [<ffffffff8118a4e5>] setxattr+0x1b5/0x230
[ 1440.910713]  [<ffffffff81167347>] ? __sb_start_write+0x1b7/0x200
[ 1440.916702]  [<ffffffff81185378>] ? mnt_want_write_file+0x28/0x60
[ 1440.922778]  [<ffffffff81182f40>] ? fget_light+0x70/0x170
[ 1440.928168]  [<ffffffff81185378>] ? mnt_want_write_file+0x28/0x60
[ 1440.934242]  [<ffffffff81182ff2>] ? fget_light+0x122/0x170
[ 1440.939713]  [<ffffffff8118a5ec>] sys_fsetxattr+0x8c/0xe0
[ 1440.945097]  [<ffffffff814b7102>] system_call_fastpath+0x16/0x1b
[ 1440.951083] Code: ef 88 00 00 00 48 89 e5 e8 a0 ff ff ff c9 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 <4c> 8b 17 41 81 e2 ff 0f 00 00 4a 8d 04 16 4c 8d 5c 10 ff 48 89 
[ 1440.971006] RIP  [<ffffffffa0584e5e>] map_private_extent_buffer+0xe/0xf0 [btrfs]
[ 1440.978415]  RSP <ffff880b740db9f8>
[ 1440.981896] CR2: 0000000000000000
[ 1440.985557] ---[ end trace 764e83a458dabca7 ]---
[ 1440.990075] divide error: 0000 [#2] SMP 
[ 1440.990133] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod iTCO_wdt iTCO_vendor_support hid_generic button ata_piix libata coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000
[ 1440.990139] CPU 20 
[ 1440.990139] Pid: 48693, comm: ceph-osd Tainted: G      D W    3.7.1-00006-gc794580 #484 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 1440.990163] RIP: 0010:[<ffffffffa059429d>]  [<ffffffffa059429d>] __btrfs_map_block+0xcd/0x670 [btrfs]
[ 1440.990187] RSP: 0018:ffff880b740f5ad8  EFLAGS: 00010246
[ 1440.990194] RAX: 0000000000800000 RBX: 0000000000800000 RCX: 0000000040000000
[ 1440.990195] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1440.990195] RBP: ffff880b740f5b68 R08: 0000000000000000 R09: 0000000000000000
[ 1440.990196] R10: ffff88062311f6e8 R11: 0000000000000000 R12: ffff880b740f5b90
[ 1440.990200] R13: ffff8805054971c0 R14: ffff880c182f4298 R15: ffff880b740f5e68
[ 1440.990201] FS:  00007fffe6cba700(0000) GS:ffff880c3fd00000(0000) knlGS:0000000000000000
[ 1440.990202] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1440.990203] CR2: ffffffffff600400 CR3: 00000004bd2a4000 CR4: 00000000000007e0
[ 1440.990207] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1440.990207] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1440.990209] Process ceph-osd (pid: 48693, threadinfo ffff880b740f4000, task ffff8809877d8000)
[ 1440.990209] Stack:
[ 1440.990217]  ffff88092bba8000 ffff880156a22e00 ffff88062311f6e8 ffff880156a23388
[ 1440.990225]  0000000000000000 ffffffff8111365d 0000000000000000 0000000000000000
[ 1440.990230]  00000000740f5b98 0000000000000046 0000000000000000 ffffffff8111365d
[ 1440.990230] Call Trace:
[ 1440.990236]  [<ffffffff8111365d>] ? test_set_page_writeback+0x6d/0x170
[ 1440.990291]  [<ffffffff8111365d>] ? test_set_page_writeback+0x6d/0x170
[ 1440.990307]  [<ffffffffa059484e>] btrfs_map_block+0xe/0x10 [btrfs]
[ 1440.990349]  [<ffffffffa0571307>] btrfs_merge_bio_hook+0x57/0x80 [btrfs]
[ 1440.990458]  [<ffffffffa0585ba3>] submit_extent_page+0xc3/0x1d0 [btrfs]
[ 1440.990487]  [<ffffffff8110a2f0>] ? find_get_pages+0x1c0/0x1c0
[ 1440.990525]  [<ffffffffa058ba7f>] __extent_writepage+0x69f/0x760 [btrfs]
[ 1440.990571]  [<ffffffffa0585ed0>] ? extent_io_tree_init+0x90/0x90 [btrfs]
[ 1440.990680]  [<ffffffffa058bf52>] extent_write_cache_pages.clone.3+0x242/0x3d0 [btrfs]
[ 1440.990733]  [<ffffffffa058c12f>] extent_writepages+0x4f/0x70 [btrfs]
[ 1440.990784]  [<ffffffffa0577630>] ? btrfs_lookup+0x70/0x70 [btrfs]
[ 1440.990848]  [<ffffffff81182ff2>] ? fget_light+0x122/0x170
[ 1440.990870]  [<ffffffffa0571df7>] btrfs_writepages+0x27/0x30 [btrfs]
[ 1440.990886]  [<ffffffff81115423>] do_writepages+0x23/0x40
[ 1440.990889]  [<ffffffff811099ce>] __filemap_fdatawrite_range+0x4e/0x50
[ 1440.990920]  [<ffffffff81109c83>] filemap_fdatawrite_range+0x13/0x20
[ 1440.990982]  [<ffffffff81195589>] sys_sync_file_range+0x109/0x170
[ 1440.991022]  [<ffffffff814b7102>] system_call_fastpath+0x16/0x1b
[ 1440.991149] Code: 66 0f 1f 44 00 00 4d 8b 6a 60 48 29 c3 8b 45 c4 41 39 45 18 b8 00 00 00 00 0f 4d 45 c4 31 d2 89 45 c4 49 63 75 10 48 89 d8 89 f7 <48> f7 f7 49 89 c6 48 89 45 c8 4c 0f af f6 4c 39 f3 73 10 0f 0b 
[ 1440.991174] RIP  [<ffffffffa059429d>] __btrfs_map_block+0xcd/0x670 [btrfs]
[ 1440.991203]  RSP <ffff880b740f5ad8>
[ 1440.991206] ---[ end trace 764e83a458dabca8 ]---
[ 1451.948155] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a9
[ 1451.956010] IP: [<ffffffffa05949d4>] btrfs_map_bio+0x184/0x220 [btrfs]
[ 1451.962580] PGD 0 
[ 1451.964620] Oops: 0000 [#3] SMP 
[ 1451.967887] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod iTCO_wdt iTCO_vendor_support hid_generic button ata_piix libata coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode mpt2sas scsi_transport_sas raid_class scsi_mod serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core cxgb4 i2c_i801 i2c_core lpc_ich mfd_core ehci_hcd uhci_hcd ioatdma i7core_edac dm_mod edac_core nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000
[ 1452.033336] CPU 5 
[ 1452.035177] Pid: 25627, comm: btrfs-worker-1 Tainted: G      D W    3.7.1-00006-gc794580 #484 Supermicro X8DTH-i/6/iF/6F/X8DTH
[ 1452.046715] RIP: 0010:[<ffffffffa05949d4>]  [<ffffffffa05949d4>] btrfs_map_bio+0x184/0x220 [btrfs]
[ 1452.055688] RSP: 0018:ffff88050e967cc8  EFLAGS: 00010202
[ 1452.060987] RAX: 000000000000000c RBX: ffff880959c9ea80 RCX: ffff880959c9ea80
[ 1452.068100] RDX: ffff88060bd03060 RSI: 0000000000000001 RDI: ffff88062311f6e8
[ 1452.075212] RBP: ffff88050e967d28 R08: ffff88060bd03060 R09: 0000000000000009
[ 1452.082327] R10: ffff88062311f6e8 R11: 0000000000000000 R12: 0000000000000001
[ 1452.089442] R13: 0000000000000000 R14: 0000000000000004 R15: ffff88092bba8000
[ 1452.096554] FS:  0000000000000000(0000) GS:ffff880627ca0000(0000) knlGS:0000000000000000
[ 1452.104621] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1452.110352] CR2: 00000000000000a9 CR3: 0000000001a0b000 CR4: 00000000000007e0
[ 1452.117466] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1452.124577] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1452.131693] Process btrfs-worker-1 (pid: 25627, threadinfo ffff88050e966000, task ffff880612418000)
[ 1452.140707] Stack:
[ 1452.142720]  0000000000000000 000000000040e010 00000001182f5470 0000000100000000
[ 1452.150160]  ffff88060bd03060 000000003f7fe000 ffff88050e967d38 ffff880959c9e7c8
[ 1452.157601]  ffff880959c9e780 ffff880c182f5470 ffff880c182f5428 ffff880c182f5418
[ 1452.165061] Call Trace:
[ 1452.167540]  [<ffffffffa0570bab>] __btrfs_submit_bio_done+0x1b/0x20 [btrfs]
[ 1452.174501]  [<ffffffffa0566a41>] run_one_async_done+0xc1/0xd0 [btrfs]
[ 1452.181027]  [<ffffffffa0596a93>] run_ordered_completions+0x83/0xd0 [btrfs]
[ 1452.187991]  [<ffffffffa05975c8>] worker_loop+0x1b8/0x410 [btrfs]
[ 1452.194087]  [<ffffffffa0597410>] ? check_pending_worker_creates+0xe0/0xe0 [btrfs]
[ 1452.201639]  [<ffffffff81066df1>] kthread+0xe1/0xf0
[ 1452.206528]  [<ffffffff81066d10>] ? __init_kthread_worker+0x70/0x70
[ 1452.212779]  [<ffffffff814b705c>] ret_from_fork+0x7c/0xb0
[ 1452.218167]  [<ffffffff81066d10>] ? __init_kthread_worker+0x70/0x70
[ 1452.224411] Code: 48 89 51 48 48 8d 14 40 48 8b 45 c0 48 c1 e2 03 48 01 d0 48 8b 40 38 48 c1 e8 09 48 89 01 48 03 55 c0 48 8b 72 30 48 85 f6 74 4c <48> 8b 86 a8 00 00 00 48 85 c0 74 40 41 83 fc 01 75 0a 8b 56 60 
[ 1452.244357] RIP  [<ffffffffa05949d4>] btrfs_map_bio+0x184/0x220 [btrfs]
[ 1452.250995]  RSP <ffff88050e967cc8>
[ 1452.254485] CR2: 00000000000000a9
[ 1452.258149] ---[ end trace 764e83a458dabca9 ]---


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Jan. 28, 2013, 9:23 p.m. UTC | #4
On Thu, Jan 03, 2013 at 11:44:46AM -0700, Jim Schutt wrote:
> Hi Josef,
> 
> Thanks for the patch - sorry for the long delay in testing...
> 

Jim,

I've been trying to reason out how this happens, could you do a btrfs fi df on
the filesystem thats giving you trouble so I can see if what I think is
happening is what's actually happening.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jim Schutt Jan. 28, 2013, 9:58 p.m. UTC | #5
On 01/28/2013 02:23 PM, Josef Bacik wrote:
> On Thu, Jan 03, 2013 at 11:44:46AM -0700, Jim Schutt wrote:
>> Hi Josef,
>>
>> Thanks for the patch - sorry for the long delay in testing...
>>
> 
> Jim,
> 
> I've been trying to reason out how this happens, could you do a btrfs fi df on
> the filesystem thats giving you trouble so I can see if what I think is
> happening is what's actually happening.  Thanks,

Sure - it'll take me a bit to set the test up again.

-- Jim

> 
> Josef
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Bo Jan. 29, 2013, 2:30 a.m. UTC | #6
On Mon, Jan 28, 2013 at 04:23:31PM -0500, Josef Bacik wrote:
> On Thu, Jan 03, 2013 at 11:44:46AM -0700, Jim Schutt wrote:
> > Hi Josef,
> > 
> > Thanks for the patch - sorry for the long delay in testing...
> > 
> 
> Jim,
> 
> I've been trying to reason out how this happens, could you do a btrfs fi df on
> the filesystem thats giving you trouble so I can see if what I think is
> happening is what's actually happening.  Thanks,

Josef,

A quick reproducer here: running xfstests 251 with autodefrag,compress=zlib

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Jan. 29, 2013, 1:47 p.m. UTC | #7
On Mon, Jan 28, 2013 at 07:30:09PM -0700, Liu Bo wrote:
> On Mon, Jan 28, 2013 at 04:23:31PM -0500, Josef Bacik wrote:
> > On Thu, Jan 03, 2013 at 11:44:46AM -0700, Jim Schutt wrote:
> > > Hi Josef,
> > > 
> > > Thanks for the patch - sorry for the long delay in testing...
> > > 
> > 
> > Jim,
> > 
> > I've been trying to reason out how this happens, could you do a btrfs fi df on
> > the filesystem thats giving you trouble so I can see if what I think is
> > happening is what's actually happening.  Thanks,
> 
> Josef,
> 
> A quick reproducer here: running xfstests 251 with autodefrag,compress=zlib
> 


251      [not run] FSTRIM is not supported

Are you sure its 251?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Jan. 29, 2013, 1:50 p.m. UTC | #8
On Tue, Jan 29, 2013 at 08:47:30AM -0500, Josef Bacik wrote:
> On Mon, Jan 28, 2013 at 07:30:09PM -0700, Liu Bo wrote:
> > On Mon, Jan 28, 2013 at 04:23:31PM -0500, Josef Bacik wrote:
> > > On Thu, Jan 03, 2013 at 11:44:46AM -0700, Jim Schutt wrote:
> > > > Hi Josef,
> > > > 
> > > > Thanks for the patch - sorry for the long delay in testing...
> > > > 
> > > 
> > > Jim,
> > > 
> > > I've been trying to reason out how this happens, could you do a btrfs fi df on
> > > the filesystem thats giving you trouble so I can see if what I think is
> > > happening is what's actually happening.  Thanks,
> > 
> > Josef,
> > 
> > A quick reproducer here: running xfstests 251 with autodefrag,compress=zlib
> > 
> 
> 
> 251      [not run] FSTRIM is not supported
> 
> Are you sure its 251?  Thanks,

Sorry it's early, I need a device that does trim.  /me waits for his fusion card
to get back from the shop,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Jan. 29, 2013, 4:43 p.m. UTC | #9
On Tue, Jan 29, 2013 at 08:50:34AM -0500, Josef Bacik wrote:
> On Tue, Jan 29, 2013 at 08:47:30AM -0500, Josef Bacik wrote:
> > 251      [not run] FSTRIM is not supported
> > 
> > Are you sure its 251?  Thanks,
> 
> Sorry it's early, I need a device that does trim.  /me waits for his fusion card
> to get back from the shop,

You can use scsi_debug device with

parm:           lbpu:enable LBP, support UNMAP command (def=0) (int)

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Sterba Jan. 29, 2013, 4:52 p.m. UTC | #10
On Tue, Jan 29, 2013 at 05:43:31PM +0100, David Sterba wrote:
> On Tue, Jan 29, 2013 at 08:50:34AM -0500, Josef Bacik wrote:
> You can use scsi_debug device with
> 
> parm:           lbpu:enable LBP, support UNMAP command (def=0) (int)

Also, loop device with a file backed by a filesystem with hole punch
support also understands TRIM.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jim Schutt Jan. 29, 2013, 6:41 p.m. UTC | #11
On 01/28/2013 02:23 PM, Josef Bacik wrote:
> On Thu, Jan 03, 2013 at 11:44:46AM -0700, Jim Schutt wrote:
>> Hi Josef,
>>
>> Thanks for the patch - sorry for the long delay in testing...
>>
> 
> Jim,
> 
> I've been trying to reason out how this happens, could you do a btrfs fi df on
> the filesystem thats giving you trouble so I can see if what I think is
> happening is what's actually happening.  Thanks,

Here's an example, using a slightly different kernel than
my previous report.  It's your btrfs-next master branch
(commit 8f139e59d5 "Btrfs: use bit operation for ->fs_state")
with ceph 3.8 for-linus (commit 0fa6ebc600 from linus' tree).


Here I'm finding the file system in question:

# ls -l /dev/mapper | grep dm-93
lrwxrwxrwx 1 root root       8 Jan 29 11:13 cs53s19p2 -> ../dm-93

# df -h | grep -A 1 cs53s19p2
/dev/mapper/cs53s19p2
                      896G  1.1G  896G   1% /ram/mnt/ceph/data.osd.522


Here's the info you asked for:

# btrfs fi df /ram/mnt/ceph/data.osd.522
Data: total=2.01GB, used=1.00GB
System: total=4.00MB, used=64.00KB
Metadata: total=8.00MB, used=7.56MB


And here's the backtrace that had trouble on dm-93.
It's a little different to my previous report:

[  705.496463] ------------[ cut here ]------------
[  705.501123] WARNING: at fs/btrfs/super.c:256 __btrfs_abort_transaction+0x60/0x110 [btrfs]()
[  705.509751] Hardware name: X8DTH-i/6/iF/6F
[  705.513862] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic iTCO_wdt iTCO_vendor_support coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core ata_piix libata mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core button lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000
[  705.580232] Pid: 33025, comm: ceph-osd Not tainted 3.7.0-00269-gd9acbfd #492
[  705.587488] Call Trace:
[  705.589957]  [<ffffffff8103ff04>] warn_slowpath_common+0x94/0xc0
[  705.596108]  [<ffffffffa055331a>] ? btrfs_free_path+0x2a/0x40 [btrfs]
[  705.602685]  [<ffffffff8103ffe6>] warn_slowpath_fmt+0x46/0x50
[  705.608563]  [<ffffffffa054c730>] __btrfs_abort_transaction+0x60/0x110 [btrfs]
[  705.615994]  [<ffffffffa05a2058>] __btrfs_alloc_chunk+0x678/0x710 [btrfs]
[  705.622945]  [<ffffffffa05a214e>] btrfs_alloc_chunk+0x5e/0x90 [btrfs]
[  705.629635]  [<ffffffffa055edb1>] ? check_system_chunk+0x71/0x130 [btrfs]
[  705.637079]  [<ffffffffa055f15c>] do_chunk_alloc+0x2ec/0x370 [btrfs]
[  705.643451]  [<ffffffffa055b199>] ? btrfs_reduce_alloc_profile+0xa9/0x120 [btrfs]
[  705.650951]  [<ffffffffa0561d1c>] btrfs_check_data_free_space+0x13c/0x2b0 [btrfs]
[  705.658446]  [<ffffffffa0564a70>] btrfs_delalloc_reserve_space+0x20/0x60 [btrfs]
[  705.665882]  [<ffffffffa058980e>] __btrfs_buffered_write+0x15e/0x340 [btrfs]
[  705.672952]  [<ffffffffa0589e29>] btrfs_file_aio_write+0x309/0x450 [btrfs]
[  705.679889]  [<ffffffffa0589b20>] ? __btrfs_direct_write+0x130/0x130 [btrfs]
[  705.686934]  [<ffffffff811626f4>] do_sync_readv_writev+0x94/0xe0
[  705.692942]  [<ffffffff811637b3>] do_readv_writev+0xe3/0x1e0
[  705.698604]  [<ffffffff81180c42>] ? fget_light+0x122/0x170
[  705.704093]  [<ffffffff811638f6>] vfs_writev+0x46/0x60
[  705.709239]  [<ffffffff81163a2f>] sys_writev+0x5f/0xc0
[  705.714388]  [<ffffffff812637ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  705.720827]  [<ffffffff814b7882>] system_call_fastpath+0x16/0x1b
[  705.726829] ---[ end trace 6e889d6d939ca116 ]---
[  705.731459] BTRFS warning (device dm-93): __btrfs_alloc_chunk:3787: Aborting unused transaction(error 28).
[  705.741187] btrfs: mapping failed logical 1099431936 bio len 524288 len 65536
[  705.741192] BTRFS warning (device dm-93): find_free_extent:5948: Aborting unused transaction(Object already exists).
[  705.759185] ------------[ cut here ]------------
[  705.763929] kernel BUG at fs/btrfs/volumes.c:4891!
[  705.768990] invalid opcode: 0000 [#1] SMP 
[  705.773561] Modules linked in: btrfs zlib_deflate ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_mirror dm_region_hash dm_log dm_round_robin dm_multipath scsi_dh vhost_net macvtap macvlan tun uinput sg joydev sd_mod hid_generic iTCO_wdt iTCO_vendor_support coretemp kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul microcode serio_raw pcspkr mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core ata_piix libata mpt2sas scsi_transport_sas raid_class scsi_mod cxgb4 i2c_i801 i2c_core button lpc_ich mfd_core ehci_hcd uhci_hcd i7core_edac edac_core dm_mod ioatdma nfsv4 auth_rpcgss nfsv3 nfs_acl nfsv2 nfs lockd sunrpc fscache broadcom tg3 hwmon bnx2 igb dca e1000
[  705.845121] CPU 22 
[  705.847114] Pid: 21317, comm: btrfs-worker-1 Tainted: G        W    3.7.0-00269-gd9acbfd #492 Supermicro X8DTH-i/6/iF/6F/X8DTH
[  705.858886] RIP: 0010:[<ffffffffa05a2f0d>]  [<ffffffffa05a2f0d>] btrfs_map_bio+0x8d/0x300 [btrfs]
[  705.867928] RSP: 0018:ffff880610ce7c58  EFLAGS: 00010296
[  705.873363] RAX: 0000000000000041 RBX: ffff88061c368480 RCX: 0000000000009291
[  705.880692] RDX: 0000000000000091 RSI: 0000000000000001 RDI: ffffffff81a21a40
[  705.888315] RBP: ffff880610ce7d08 R08: 0000000000000001 R09: 0000000000000001
[  705.895805] R10: 00000000000007ca R11: 0000000000000001 R12: 0000000041880000
[  705.903139] R13: 0000000000080000 R14: ffff880c12621468 R15: ffff880c12621458
[  705.910467] FS:  0000000000000000(0000) GS:ffff880c3fd40000(0000) knlGS:0000000000000000
[  705.918978] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  705.925036] CR2: ffffffffff600400 CR3: 0000000001a0b000 CR4: 00000000000007e0
[  705.932406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  705.939818] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  705.947461] Process btrfs-worker-1 (pid: 21317, threadinfo ffff880610ce6000, task ffff880613b1bec0)
[  705.957264] Stack:
[  705.959806]  ffff8805e0f64000 ffff8808e5b12188 ffff880613b1c578 000004aa11555000
[  705.970044]  ffff880c00000000 ffff880c126214b0 0000000100000000 ffff8805eddd2000
[  705.979630]  0000000000000001 0000000100000411 ffff880610ce7d28 0000000000000246
[  705.989568] Call Trace:
[  705.992386]  [<ffffffffa05a3cf0>] ? run_ordered_completions+0x40/0xd0 [btrfs]
[  706.000651]  [<ffffffffa057bd43>] __btrfs_submit_bio_done+0x23/0x40 [btrfs]
[  706.008210]  [<ffffffffa0570ba1>] run_one_async_done+0xc1/0xd0 [btrfs]
[  706.015049]  [<ffffffffa05a3d33>] run_ordered_completions+0x83/0xd0 [btrfs]
[  706.022246]  [<ffffffffa05a4868>] worker_loop+0x1b8/0x410 [btrfs]
[  706.028930]  [<ffffffffa05a46b0>] ? check_pending_worker_creates+0xe0/0xe0 [btrfs]
[  706.037561]  [<ffffffff81067561>] kthread+0xe1/0xf0
[  706.042896]  [<ffffffff81067480>] ? __init_kthread_worker+0x70/0x70
[  706.049524]  [<ffffffff814b77dc>] ret_from_fork+0x7c/0xb0
[  706.055314]  [<ffffffff81067480>] ? __init_kthread_worker+0x70/0x70
[  706.062429] Code: 56 02 00 00 48 8b 45 c0 48 8b 4d c8 8b 50 28 49 39 cd 89 55 9c 76 1f 4c 89 ea 4c 89 e6 48 c7 c7 e8 a6 5e a0 31 c0 e8 93 84 f0 e0 <0f> 0b 90 eb fe 66 0f 1f 44 00 00 48 89 58 10 48 8b 53 48 48 8b 
[  706.090905] RIP  [<ffffffffa05a2f0d>] btrfs_map_bio+0x8d/0x300 [btrfs]
[  706.098098]  RSP <ffff880610ce7c58>
[  706.102125] ---[ end trace 6e889d6d939ca117 ]---

-- Jim

> 
> Josef
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas Feb. 18, 2014, 3:47 p.m. UTC | #12
Hello Josef,

On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik <jbacik@fusionio.com> wrote:
> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
>> An user reported that he has hit an annoying deadlock while playing with
>> ceph based on btrfs.
>>
>> Current updating device tree requires space from METADATA chunk,
>> so we -may- need to do a recursive chunk allocation when adding/updating
>> dev extent, that is where the deadlock comes from.
>>
>> If we use SYSTEM metadata to update device tree, we can avoid the recursive
>> stuff.
>>
>
> This is going to cause us to allocate much more system chunks than we used to
> which could land us in trouble.  Instead let's just keep us from re-entering if
> we're already allocating a chunk.  We do the chunk allocation when we don't have
> enough space for a cluster, but we'll likely have plenty of space to make an
> allocation.  Can you give this patch a try Jim and see if it fixes your problem?
> Thanks,
>
> Josef
>
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e152809..59df5e7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3564,6 +3564,10 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans,
>         int wait_for_alloc = 0;
>         int ret = 0;
>
> +       /* Don't re-enter if we're already allocating a chunk */
> +       if (trans->allocating_chunk)
> +               return -ENOSPC;
> +
>         space_info = __find_space_info(extent_root->fs_info, flags);
>         if (!space_info) {
>                 ret = update_space_info(extent_root->fs_info, flags,
> @@ -3606,6 +3610,8 @@ again:
>                 goto again;
>         }
>
> +       trans->allocating_chunk = true;
> +
>         /*
>          * If we have mixed data/metadata chunks we want to make sure we keep
>          * allocating mixed chunks instead of individual chunks.
> @@ -3632,6 +3638,7 @@ again:
>         check_system_chunk(trans, extent_root, flags);
>
>         ret = btrfs_alloc_chunk(trans, extent_root, flags);
> +       trans->allocating_chunk = false;
>         if (ret < 0 && ret != -ENOSPC)
>                 goto out;
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index e6509b9..47ad8be 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -388,6 +388,7 @@ again:
>         h->qgroup_reserved = qgroup_reserved;
>         h->delayed_ref_elem.seq = 0;
>         h->type = type;
> +       h->allocating_chunk = false;
>         INIT_LIST_HEAD(&h->qgroup_ref_list);
>         INIT_LIST_HEAD(&h->new_bgs);
>
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 0e8aa1e..69700f7 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -68,6 +68,7 @@ struct btrfs_trans_handle {
>         struct btrfs_block_rsv *orig_rsv;
>         short aborted;
>         short adding_csums;
> +       bool allocating_chunk;
>         enum btrfs_trans_type type;
>         /*
>          * this root is only needed to validate that the root passed to

I hit this problem in a following scenario:
- a data chunk allocation is triggered, and locks chunk_mutex
- the same thread now also wants to allocate a metadata chunk, so it
recursively calls do_chunk_alloc, but cannot lock the chunk_mutex =>
deadlock
- btrfs has only one metadata chunk, the one that was initially
allocated by mkfs, it has:
total_bytes=8388608
bytes_used=8130560
bytes_pinned=77824
bytes_reserved=180224
so bytes_used + bytes_pinned + bytes_reserved == total_bytes

Your patch would have returned ENOSPC and avoid the deadlock, but
there would be a failure to allocate a tree block for metadata. So the
transaction would have probably aborted.

How such situation should be handled?

Idea1:
- lock chunk mutex,
- if we are allocating a data chunk, check whether the metadata space
is below some threshold. If yes, go and allocate a metadata chunk
first and then only a data chunk.

Idea2:
- check if we are the same thread that already locked the chunk mutex.
If yes, allow recursive call but don't attempt to lock/unlock the
chunk_mutex this time

Or some other way?

Thanks!
Alex.






> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Feb. 18, 2014, 4:06 p.m. UTC | #13
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



On 02/18/2014 10:47 AM, Alex Lyakas wrote:
> Hello Josef,
> 
> On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik <jbacik@fusionio.com>
> wrote:
>> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
>>> An user reported that he has hit an annoying deadlock while
>>> playing with ceph based on btrfs.
>>> 
>>> Current updating device tree requires space from METADATA
>>> chunk, so we -may- need to do a recursive chunk allocation when
>>> adding/updating dev extent, that is where the deadlock comes
>>> from.
>>> 
>>> If we use SYSTEM metadata to update device tree, we can avoid
>>> the recursive stuff.
>>> 
>> 
>> This is going to cause us to allocate much more system chunks
>> than we used to which could land us in trouble.  Instead let's
>> just keep us from re-entering if we're already allocating a
>> chunk.  We do the chunk allocation when we don't have enough
>> space for a cluster, but we'll likely have plenty of space to
>> make an allocation.  Can you give this patch a try Jim and see if
>> it fixes your problem? Thanks,
>> 
>> Josef
>> 
>> 
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c 
>> index e152809..59df5e7 100644 --- a/fs/btrfs/extent-tree.c +++
>> b/fs/btrfs/extent-tree.c @@ -3564,6 +3564,10 @@ static int
>> do_chunk_alloc(struct btrfs_trans_handle *trans, int
>> wait_for_alloc = 0; int ret = 0;
>> 
>> +       /* Don't re-enter if we're already allocating a chunk */ 
>> +       if (trans->allocating_chunk) +               return
>> -ENOSPC; + space_info = __find_space_info(extent_root->fs_info,
>> flags); if (!space_info) { ret =
>> update_space_info(extent_root->fs_info, flags, @@ -3606,6 +3610,8
>> @@ again: goto again; }
>> 
>> +       trans->allocating_chunk = true; + /* * If we have mixed
>> data/metadata chunks we want to make sure we keep * allocating
>> mixed chunks instead of individual chunks. @@ -3632,6 +3638,7 @@
>> again: check_system_chunk(trans, extent_root, flags);
>> 
>> ret = btrfs_alloc_chunk(trans, extent_root, flags); +
>> trans->allocating_chunk = false; if (ret < 0 && ret != -ENOSPC) 
>> goto out;
>> 
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c 
>> index e6509b9..47ad8be 100644 --- a/fs/btrfs/transaction.c +++
>> b/fs/btrfs/transaction.c @@ -388,6 +388,7 @@ again: 
>> h->qgroup_reserved = qgroup_reserved; h->delayed_ref_elem.seq =
>> 0; h->type = type; +       h->allocating_chunk = false; 
>> INIT_LIST_HEAD(&h->qgroup_ref_list); 
>> INIT_LIST_HEAD(&h->new_bgs);
>> 
>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h 
>> index 0e8aa1e..69700f7 100644 --- a/fs/btrfs/transaction.h +++
>> b/fs/btrfs/transaction.h @@ -68,6 +68,7 @@ struct
>> btrfs_trans_handle { struct btrfs_block_rsv *orig_rsv; short
>> aborted; short adding_csums; +       bool allocating_chunk; enum
>> btrfs_trans_type type; /* * this root is only needed to validate
>> that the root passed to
> 
> I hit this problem in a following scenario: - a data chunk
> allocation is triggered, and locks chunk_mutex - the same thread
> now also wants to allocate a metadata chunk, so it recursively
> calls do_chunk_alloc, but cannot lock the chunk_mutex => deadlock -
> btrfs has only one metadata chunk, the one that was initially 
> allocated by mkfs, it has: total_bytes=8388608 bytes_used=8130560 
> bytes_pinned=77824 bytes_reserved=180224 so bytes_used +
> bytes_pinned + bytes_reserved == total_bytes
> 
> Your patch would have returned ENOSPC and avoid the deadlock, but 
> there would be a failure to allocate a tree block for metadata. So
> the transaction would have probably aborted.
> 
> How such situation should be handled?
> 
> Idea1: - lock chunk mutex, - if we are allocating a data chunk,
> check whether the metadata space is below some threshold. If yes,
> go and allocate a metadata chunk first and then only a data chunk.
> 
> Idea2: - check if we are the same thread that already locked the
> chunk mutex. If yes, allow recursive call but don't attempt to
> lock/unlock the chunk_mutex this time
> 
> Or some other way?
> 

I fixed this with the delayed chunk allocation stuff which doesn't
actually do the block group creation stuff until we end the
transaction, so we can allocate metadata chunks without any issue.
Thanks,

Josef
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTA4UMAAoJEANb+wAKly3B+KEP/RdlEyJWydetjQxllF0cgHY1
UraqWBl+mSSHlwZlHyGjmAu6cK6n+QfTZtdIBhihdY50UcvMuWtVmz2JzlbxeO5+
88dBevADmW+QQoRl0yyQgnjlLWm+LvMTgOd1r+DZqlGs6sdX05dMI207+fQOW+c4
P+UKbT/eUYRVC4K//J1GUk4Yh3Q70U25321RWCehSUciwDVJO2LztD9VBAgh3qUc
o5uh5syshS3RbEi0hnUQ8tDKXWvdZQBA2RF4loXACCmQO95e84mxVpoYPd9S1yYs
J+wf+Bak5hKZxmXJkOVcjLj4GsVQFJWTBTj6FvOFrm5TAFEGSyzrEzL8xW361+VS
I1q8GPSVN1fGKkVypddylLIXLHmqXb57UElvGhoBM0otxNd8+xfSpLZ045vv5qLx
RKwhJI1gIWD59kBre0fdSkUJZDeYSmLWOiwG6hG3A7Yy93c6/1RLHRnHq5NEe12R
nrqZKBnkvDKnL/21eVqpOMo7i/AzCB7N+ojfaql2WvWcLkCpomhLBgC18Q1RiSzZ
nfmafQIUPunM4l/fLXsbYFdiUu2jSZWZuTpOV71lYUqfrUydqBCZqTpWAlmfkNQ7
C4BHMtgfiRn6CI2KzpP6DpdGJbxjExEWzwheaswffN5TzOxEHQeRvHOKI41ln1i7
UfdifDhUx+zZl0TxMesQ
=elae
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Lyakas Feb. 18, 2014, 4:24 p.m. UTC | #14
Hi Josef,
is this the commit to look at:
6df9a95e63395f595d0d1eb5d561dd6c91c40270 Btrfs: make the chunk
allocator completely tree lockless

or some other commits are also relevant?

Alex.


On Tue, Feb 18, 2014 at 6:06 PM, Josef Bacik <jbacik@fb.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>
> On 02/18/2014 10:47 AM, Alex Lyakas wrote:
>> Hello Josef,
>>
>> On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik <jbacik@fusionio.com>
>> wrote:
>>> On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
>>>> An user reported that he has hit an annoying deadlock while
>>>> playing with ceph based on btrfs.
>>>>
>>>> Current updating device tree requires space from METADATA
>>>> chunk, so we -may- need to do a recursive chunk allocation when
>>>> adding/updating dev extent, that is where the deadlock comes
>>>> from.
>>>>
>>>> If we use SYSTEM metadata to update device tree, we can avoid
>>>> the recursive stuff.
>>>>
>>>
>>> This is going to cause us to allocate much more system chunks
>>> than we used to which could land us in trouble.  Instead let's
>>> just keep us from re-entering if we're already allocating a
>>> chunk.  We do the chunk allocation when we don't have enough
>>> space for a cluster, but we'll likely have plenty of space to
>>> make an allocation.  Can you give this patch a try Jim and see if
>>> it fixes your problem? Thanks,
>>>
>>> Josef
>>>
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index e152809..59df5e7 100644 --- a/fs/btrfs/extent-tree.c +++
>>> b/fs/btrfs/extent-tree.c @@ -3564,6 +3564,10 @@ static int
>>> do_chunk_alloc(struct btrfs_trans_handle *trans, int
>>> wait_for_alloc = 0; int ret = 0;
>>>
>>> +       /* Don't re-enter if we're already allocating a chunk */
>>> +       if (trans->allocating_chunk) +               return
>>> -ENOSPC; + space_info = __find_space_info(extent_root->fs_info,
>>> flags); if (!space_info) { ret =
>>> update_space_info(extent_root->fs_info, flags, @@ -3606,6 +3610,8
>>> @@ again: goto again; }
>>>
>>> +       trans->allocating_chunk = true; + /* * If we have mixed
>>> data/metadata chunks we want to make sure we keep * allocating
>>> mixed chunks instead of individual chunks. @@ -3632,6 +3638,7 @@
>>> again: check_system_chunk(trans, extent_root, flags);
>>>
>>> ret = btrfs_alloc_chunk(trans, extent_root, flags); +
>>> trans->allocating_chunk = false; if (ret < 0 && ret != -ENOSPC)
>>> goto out;
>>>
>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>>> index e6509b9..47ad8be 100644 --- a/fs/btrfs/transaction.c +++
>>> b/fs/btrfs/transaction.c @@ -388,6 +388,7 @@ again:
>>> h->qgroup_reserved = qgroup_reserved; h->delayed_ref_elem.seq =
>>> 0; h->type = type; +       h->allocating_chunk = false;
>>> INIT_LIST_HEAD(&h->qgroup_ref_list);
>>> INIT_LIST_HEAD(&h->new_bgs);
>>>
>>> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
>>> index 0e8aa1e..69700f7 100644 --- a/fs/btrfs/transaction.h +++
>>> b/fs/btrfs/transaction.h @@ -68,6 +68,7 @@ struct
>>> btrfs_trans_handle { struct btrfs_block_rsv *orig_rsv; short
>>> aborted; short adding_csums; +       bool allocating_chunk; enum
>>> btrfs_trans_type type; /* * this root is only needed to validate
>>> that the root passed to
>>
>> I hit this problem in a following scenario: - a data chunk
>> allocation is triggered, and locks chunk_mutex - the same thread
>> now also wants to allocate a metadata chunk, so it recursively
>> calls do_chunk_alloc, but cannot lock the chunk_mutex => deadlock -
>> btrfs has only one metadata chunk, the one that was initially
>> allocated by mkfs, it has: total_bytes=8388608 bytes_used=8130560
>> bytes_pinned=77824 bytes_reserved=180224 so bytes_used +
>> bytes_pinned + bytes_reserved == total_bytes
>>
>> Your patch would have returned ENOSPC and avoid the deadlock, but
>> there would be a failure to allocate a tree block for metadata. So
>> the transaction would have probably aborted.
>>
>> How such situation should be handled?
>>
>> Idea1: - lock chunk mutex, - if we are allocating a data chunk,
>> check whether the metadata space is below some threshold. If yes,
>> go and allocate a metadata chunk first and then only a data chunk.
>>
>> Idea2: - check if we are the same thread that already locked the
>> chunk mutex. If yes, allow recursive call but don't attempt to
>> lock/unlock the chunk_mutex this time
>>
>> Or some other way?
>>
>
> I fixed this with the delayed chunk allocation stuff which doesn't
> actually do the block group creation stuff until we end the
> transaction, so we can allocate metadata chunks without any issue.
> Thanks,
>
> Josef
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJTA4UMAAoJEANb+wAKly3B+KEP/RdlEyJWydetjQxllF0cgHY1
> UraqWBl+mSSHlwZlHyGjmAu6cK6n+QfTZtdIBhihdY50UcvMuWtVmz2JzlbxeO5+
> 88dBevADmW+QQoRl0yyQgnjlLWm+LvMTgOd1r+DZqlGs6sdX05dMI207+fQOW+c4
> P+UKbT/eUYRVC4K//J1GUk4Yh3Q70U25321RWCehSUciwDVJO2LztD9VBAgh3qUc
> o5uh5syshS3RbEi0hnUQ8tDKXWvdZQBA2RF4loXACCmQO95e84mxVpoYPd9S1yYs
> J+wf+Bak5hKZxmXJkOVcjLj4GsVQFJWTBTj6FvOFrm5TAFEGSyzrEzL8xW361+VS
> I1q8GPSVN1fGKkVypddylLIXLHmqXb57UElvGhoBM0otxNd8+xfSpLZ045vv5qLx
> RKwhJI1gIWD59kBre0fdSkUJZDeYSmLWOiwG6hG3A7Yy93c6/1RLHRnHq5NEe12R
> nrqZKBnkvDKnL/21eVqpOMo7i/AzCB7N+ojfaql2WvWcLkCpomhLBgC18Q1RiSzZ
> nfmafQIUPunM4l/fLXsbYFdiUu2jSZWZuTpOV71lYUqfrUydqBCZqTpWAlmfkNQ7
> C4BHMtgfiRn6CI2KzpP6DpdGJbxjExEWzwheaswffN5TzOxEHQeRvHOKI41ln1i7
> UfdifDhUx+zZl0TxMesQ
> =elae
> -----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Josef Bacik Feb. 18, 2014, 4:26 p.m. UTC | #15
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/18/2014 11:24 AM, Alex Lyakas wrote:
> Hi Josef, is this the commit to look at: 
> 6df9a95e63395f595d0d1eb5d561dd6c91c40270 Btrfs: make the chunk 
> allocator completely tree lockless
> 
> or some other commits are also relevant?
> 

It's been so long but I'm pretty sure everything you need is in that
patch.  Thanks,

Josef
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTA4nPAAoJEANb+wAKly3BlNwP/AyG7LtDo6YYaYvXJyBJa1Vj
hq2C48lwQhSjAYfn5QJ05AUOgL4xAb1THSjDqTIyoyEXGwBnRLEaX3/MFygthrxi
9u137ys1C/EQr3fmRecdz6Qpojkf5EAxiK8J0nL+G/BXoJYdwCYUj4oLOgqwP6/X
/XhsqyCLmj8jATndYCz7Z68xfutF37xtId0mWEsRrnvMqrT5nDvA/WpkzYE+ovc3
OhffFHfHJAf94qMb6EtSpH3E2MJDIYfp6cIAEgEK2ougZLnf0lkjcCXd2B6fRLcY
9WuZaVsi4J+vqGxVwnxDaJ7TbjEDXbl+bnAs5R5VDKZUy56zOxNA9//ejCuYtl/P
r5K0PKZXu81wiK22DbF0hhZfzdkElnVqx8DSgwTyyo5aJTj6cNuDRdPmTz4TEbib
N8z7rGC85Y4Z9Z1Gwnj3cD6pKQU4+anUhkIWNFVM9SpWbjYXgjjTMAj/LaM6GhJL
OptTORUwu4+9hGnfu7ItL8uyVrBwyh9cUcbru79D0+YyyWR5fDsgYFCtvUuhJ16q
vrViGT2MVyt4ZevvJMG02997sC8OCyeF4W0eQgyvgSOJToeoOJ57j8z/mSUntqDE
94f6hqOBjN6UY6/2FFILeMH0xuF0Li5JUOYB5Da99iHByeHQ4hrBWVyyvZfqW4vN
YY32d8J7Ine1N7/IZdVh
=jn5g
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e152809..59df5e7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3564,6 +3564,10 @@  static int do_chunk_alloc(struct btrfs_trans_handle *trans,
 	int wait_for_alloc = 0;
 	int ret = 0;
 
+	/* Don't re-enter if we're already allocating a chunk */
+	if (trans->allocating_chunk)
+		return -ENOSPC;
+
 	space_info = __find_space_info(extent_root->fs_info, flags);
 	if (!space_info) {
 		ret = update_space_info(extent_root->fs_info, flags,
@@ -3606,6 +3610,8 @@  again:
 		goto again;
 	}
 
+	trans->allocating_chunk = true;
+
 	/*
 	 * If we have mixed data/metadata chunks we want to make sure we keep
 	 * allocating mixed chunks instead of individual chunks.
@@ -3632,6 +3638,7 @@  again:
 	check_system_chunk(trans, extent_root, flags);
 
 	ret = btrfs_alloc_chunk(trans, extent_root, flags);
+	trans->allocating_chunk = false;
 	if (ret < 0 && ret != -ENOSPC)
 		goto out;
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index e6509b9..47ad8be 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -388,6 +388,7 @@  again:
 	h->qgroup_reserved = qgroup_reserved;
 	h->delayed_ref_elem.seq = 0;
 	h->type = type;
+	h->allocating_chunk = false;
 	INIT_LIST_HEAD(&h->qgroup_ref_list);
 	INIT_LIST_HEAD(&h->new_bgs);
 
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 0e8aa1e..69700f7 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -68,6 +68,7 @@  struct btrfs_trans_handle {
 	struct btrfs_block_rsv *orig_rsv;
 	short aborted;
 	short adding_csums;
+	bool allocating_chunk;
 	enum btrfs_trans_type type;
 	/*
 	 * this root is only needed to validate that the root passed to