diff mbox series

btrfs-progs: convert: Workaround delayed ref bug by limiting the size of a transaction

Message ID 20190527051054.533-1-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs-progs: convert: Workaround delayed ref bug by limiting the size of a transaction | expand

Commit Message

Qu Wenruo May 27, 2019, 5:10 a.m. UTC
In convert we use trans->block_reserved >= 4096 as a threshold to commit
transaction, where block_reserved is the number of new tree blocks
allocated inside a transaction.

The problem is, we still have a hidden bug in delayed ref implementation
in btrfs-progs, when we have a large enough transaction, delayed ref may
failed to find certain tree blocks in extent tree and cause transaction
abort.

This workaround will workaround it by committing transaction at a much
lower threshold.

The old 4096 means 4096 new tree blocks, when using default (16K)
nodesize, it's 64M, which can contain over 12k inlined data extent or
csum for around 60G, or over 800K file extents.

The new threshold will limit the size of new tree blocks to 2M, aligning
with the chunk preallocator threshold, and reducing the possibility to
hit that delayed ref bug.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 convert/source-ext2.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

David Sterba May 27, 2019, 2:10 p.m. UTC | #1
On Mon, May 27, 2019 at 01:10:54PM +0800, Qu Wenruo wrote:
> In convert we use trans->block_reserved >= 4096 as a threshold to commit
> transaction, where block_reserved is the number of new tree blocks
> allocated inside a transaction.
> 
> The problem is, we still have a hidden bug in delayed ref implementation
> in btrfs-progs, when we have a large enough transaction, delayed ref may
> failed to find certain tree blocks in extent tree and cause transaction
> abort.
> 
> This workaround will workaround it by committing transaction at a much
> lower threshold.
> 
> The old 4096 means 4096 new tree blocks, when using default (16K)
> nodesize, it's 64M, which can contain over 12k inlined data extent or
> csum for around 60G, or over 800K file extents.
> 
> The new threshold will limit the size of new tree blocks to 2M, aligning
> with the chunk preallocator threshold, and reducing the possibility to
> hit that delayed ref bug.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Added to devel, thanks.
Qu Wenruo May 27, 2019, 2:15 p.m. UTC | #2
On 2019/5/27 下午10:10, David Sterba wrote:
> On Mon, May 27, 2019 at 01:10:54PM +0800, Qu Wenruo wrote:
>> In convert we use trans->block_reserved >= 4096 as a threshold to commit
>> transaction, where block_reserved is the number of new tree blocks
>> allocated inside a transaction.
>>
>> The problem is, we still have a hidden bug in delayed ref implementation
>> in btrfs-progs, when we have a large enough transaction, delayed ref may
>> failed to find certain tree blocks in extent tree and cause transaction
>> abort.
>>
>> This workaround will workaround it by committing transaction at a much
>> lower threshold.
>>
>> The old 4096 means 4096 new tree blocks, when using default (16K)
>> nodesize, it's 64M, which can contain over 12k inlined data extent or
>> csum for around 60G, or over 800K file extents.
>>
>> The new threshold will limit the size of new tree blocks to 2M, aligning
>> with the chunk preallocator threshold, and reducing the possibility to
>> hit that delayed ref bug.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Added to devel, thanks.
> 
BTW, this is really just a workaround.

The ENOSPC itself should be solved by patches: "btrfs-progs: Fix false
ENOSPC alert by tracking used space correctly" and "[PATCH 0/2]
btrfs-progs: Metadata preallocation enhancement".

The root cause of the delayed ref "unable to find backref" bug is still
unknown, but considering how large the transaction needs to be before
hitting that, this workaround should work.

Thanks,
Qu
Nikolay Borisov May 27, 2019, 2:16 p.m. UTC | #3
On 27.05.19 г. 8:10 ч., Qu Wenruo wrote:
> In convert we use trans->block_reserved >= 4096 as a threshold to commit
> transaction, where block_reserved is the number of new tree blocks
> allocated inside a transaction.
> 
> The problem is, we still have a hidden bug in delayed ref implementation
> in btrfs-progs, when we have a large enough transaction, delayed ref may
> failed to find certain tree blocks in extent tree and cause transaction
> abort.
> 

Wouldn't it make more sense to actually debug that hidden bug? Can it be
reproduced by some synthetic test? Arguably, it should be a lot easier
to debug that in user space with gdb and all that ?

> This workaround will workaround it by committing transaction at a much
> lower threshold.
> 
> The old 4096 means 4096 new tree blocks, when using default (16K)
> nodesize, it's 64M, which can contain over 12k inlined data extent or
> csum for around 60G, or over 800K file extents.
> 
> The new threshold will limit the size of new tree blocks to 2M, aligning
> with the chunk preallocator threshold, and reducing the possibility to
> hit that delayed ref bug.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  convert/source-ext2.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/convert/source-ext2.c b/convert/source-ext2.c
> index a136e5652898..63cf71fe10d5 100644
> --- a/convert/source-ext2.c
> +++ b/convert/source-ext2.c
> @@ -829,7 +829,18 @@ static int ext2_copy_inodes(struct btrfs_convert_context *cctx,
>  		pthread_mutex_unlock(&p->mutex);
>  		if (ret)
>  			return ret;
> -		if (trans->blocks_used >= 4096) {
> +		/*
> +		 * blocks_used is the number of new tree blocks allocated in
> +		 * current transaction.
> +		 * Use a small amount of it to workaround a bug where delayed
> +		 * ref may fail to locate tree blocks in extent tree.
> +		 *
> +		 * 2M is the threshold to kick chunk preallocator into work,
> +		 * For default (16K) nodesize it will be 128 tree blocks,
> +		 * large enough to contain over 300 inlined files or
> +		 * around 26k file extents. Which should be good enough.
> +		 */
> +		if (trans->blocks_used >= SZ_2M / root->fs_info->nodesize) {
>  			ret = btrfs_commit_transaction(trans, root);
>  			BUG_ON(ret);
>  			trans = btrfs_start_transaction(root, 1);
>
Qu Wenruo May 27, 2019, 2:19 p.m. UTC | #4
On 2019/5/27 下午10:16, Nikolay Borisov wrote:
> 
> 
> On 27.05.19 г. 8:10 ч., Qu Wenruo wrote:
>> In convert we use trans->block_reserved >= 4096 as a threshold to commit
>> transaction, where block_reserved is the number of new tree blocks
>> allocated inside a transaction.
>>
>> The problem is, we still have a hidden bug in delayed ref implementation
>> in btrfs-progs, when we have a large enough transaction, delayed ref may
>> failed to find certain tree blocks in extent tree and cause transaction
>> abort.
>>
> 
> Wouldn't it make more sense to actually debug that hidden bug?

Definitely, but some problem below makes it pretty tricky to pin down.

> Can it be
> reproduced by some synthetic test?

No tests can reproduce yet.

> Arguably, it should be a lot easier
> to debug that in user space with gdb and all that ?

The reporter is using an ext4 over 1TB used to reproduce this.
Not something we can easily reproduce.

But with that debug patch, we should have some clue soon.

Thanks,
Qu

> 
>> This workaround will workaround it by committing transaction at a much
>> lower threshold.
>>
>> The old 4096 means 4096 new tree blocks, when using default (16K)
>> nodesize, it's 64M, which can contain over 12k inlined data extent or
>> csum for around 60G, or over 800K file extents.
>>
>> The new threshold will limit the size of new tree blocks to 2M, aligning
>> with the chunk preallocator threshold, and reducing the possibility to
>> hit that delayed ref bug.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  convert/source-ext2.c | 13 ++++++++++++-
>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/convert/source-ext2.c b/convert/source-ext2.c
>> index a136e5652898..63cf71fe10d5 100644
>> --- a/convert/source-ext2.c
>> +++ b/convert/source-ext2.c
>> @@ -829,7 +829,18 @@ static int ext2_copy_inodes(struct btrfs_convert_context *cctx,
>>  		pthread_mutex_unlock(&p->mutex);
>>  		if (ret)
>>  			return ret;
>> -		if (trans->blocks_used >= 4096) {
>> +		/*
>> +		 * blocks_used is the number of new tree blocks allocated in
>> +		 * current transaction.
>> +		 * Use a small amount of it to workaround a bug where delayed
>> +		 * ref may fail to locate tree blocks in extent tree.
>> +		 *
>> +		 * 2M is the threshold to kick chunk preallocator into work,
>> +		 * For default (16K) nodesize it will be 128 tree blocks,
>> +		 * large enough to contain over 300 inlined files or
>> +		 * around 26k file extents. Which should be good enough.
>> +		 */
>> +		if (trans->blocks_used >= SZ_2M / root->fs_info->nodesize) {
>>  			ret = btrfs_commit_transaction(trans, root);
>>  			BUG_ON(ret);
>>  			trans = btrfs_start_transaction(root, 1);
>>
diff mbox series

Patch

diff --git a/convert/source-ext2.c b/convert/source-ext2.c
index a136e5652898..63cf71fe10d5 100644
--- a/convert/source-ext2.c
+++ b/convert/source-ext2.c
@@ -829,7 +829,18 @@  static int ext2_copy_inodes(struct btrfs_convert_context *cctx,
 		pthread_mutex_unlock(&p->mutex);
 		if (ret)
 			return ret;
-		if (trans->blocks_used >= 4096) {
+		/*
+		 * blocks_used is the number of new tree blocks allocated in
+		 * current transaction.
+		 * Use a small amount of it to workaround a bug where delayed
+		 * ref may fail to locate tree blocks in extent tree.
+		 *
+		 * 2M is the threshold to kick chunk preallocator into work,
+		 * For default (16K) nodesize it will be 128 tree blocks,
+		 * large enough to contain over 300 inlined files or
+		 * around 26k file extents. Which should be good enough.
+		 */
+		if (trans->blocks_used >= SZ_2M / root->fs_info->nodesize) {
 			ret = btrfs_commit_transaction(trans, root);
 			BUG_ON(ret);
 			trans = btrfs_start_transaction(root, 1);