diff mbox series

btrfs: kill update_block_group_flags

Message ID 20200106165015.18985-1-josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series btrfs: kill update_block_group_flags | expand

Commit Message

Josef Bacik Jan. 6, 2020, 4:50 p.m. UTC
btrfs/061 has been failing consistently for me recently with a
transaction abort.  We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.

Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.

Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level.  This
is accomplished through the normal get_alloc_profile path.

Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices.  This trips
this check

alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);

in btrfs_inc_block_group_ro.  Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks.  This eventually causes us to run out of system space and the
file system aborts and flips read only.

We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation.  This keeps us from
allocating a billion system chunks and falling over.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c | 52 ++----------------------------------------
 1 file changed, 2 insertions(+), 50 deletions(-)

Comments

Qu Wenruo Jan. 7, 2020, 11:08 a.m. UTC | #1
On 2020/1/7 上午12:50, Josef Bacik wrote:
> btrfs/061 has been failing consistently for me recently with a
> transaction abort.  We run out of space in the system chunk array, which
> means we've allocated way too many system chunks than we need.

Isn't that caused by scrubbing creating unnecessary system chunks?

IIRC I had a patch to address that problem by just simply not allocating
system chunks for scrub.
("btrfs: scrub: Don't check free space before marking a block  group RO")

Although that doesn't address the whole problem, but it should at least
reduce the possibility.


Furthermore, with the newer over-commit behavior for inc_block_group_ro
("btrfs: use btrfs_can_overcommit in inc_block_group_ro"), we won't
really allocate new system chunks anymore if we can over-commit.

With those two patches, I guess we should have solved the problem.
Or did I miss something?

Thanks,
Qu

> 
> Chris added this a long time ago for balance as a poor mans restriping.
> If you had a single disk and then added another disk and then did a
> balance, update_block_group_flags would then figure out which RAID level
> you needed.
> 
> Fast forward to today and we have restriping behavior, so we can
> explicitly tell the fs that we're trying to change the raid level.  This
> is accomplished through the normal get_alloc_profile path.
> 
> Furthermore this code actually causes btrfs/061 to fail, because we do
> things like mkfs -m dup -d single with multiple devices.  This trips
> this check
> 
> alloc_flags = update_block_group_flags(fs_info, cache->flags);
> if (alloc_flags != cache->flags) {
> 	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
> 
> in btrfs_inc_block_group_ro.  Because we're balancing and scrubbing, but
> not actually restriping, we keep forcing chunk allocation of RAID1
> chunks.  This eventually causes us to run out of system space and the
> file system aborts and flips read only.
> 
> We don't need this poor mans restriping any more, simply use the normal
> get_alloc_profile helper, which will get the correct alloc_flags and
> thus make the right decision for chunk allocation.  This keeps us from
> allocating a billion system chunks and falling over.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-group.c | 52 ++----------------------------------------
>  1 file changed, 2 insertions(+), 50 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index c79eccf188c5..0257e6f1efb1 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1975,54 +1975,6 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
>  	return 0;
>  }
>  
> -static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
> -{
> -	u64 num_devices;
> -	u64 stripped;
> -
> -	/*
> -	 * if restripe for this chunk_type is on pick target profile and
> -	 * return, otherwise do the usual balance
> -	 */
> -	stripped = get_restripe_target(fs_info, flags);
> -	if (stripped)
> -		return extended_to_chunk(stripped);
> -
> -	num_devices = fs_info->fs_devices->rw_devices;
> -
> -	stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID56_MASK |
> -		BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
> -
> -	if (num_devices == 1) {
> -		stripped |= BTRFS_BLOCK_GROUP_DUP;
> -		stripped = flags & ~stripped;
> -
> -		/* turn raid0 into single device chunks */
> -		if (flags & BTRFS_BLOCK_GROUP_RAID0)
> -			return stripped;
> -
> -		/* turn mirroring into duplication */
> -		if (flags & (BTRFS_BLOCK_GROUP_RAID1_MASK |
> -			     BTRFS_BLOCK_GROUP_RAID10))
> -			return stripped | BTRFS_BLOCK_GROUP_DUP;
> -	} else {
> -		/* they already had raid on here, just return */
> -		if (flags & stripped)
> -			return flags;
> -
> -		stripped |= BTRFS_BLOCK_GROUP_DUP;
> -		stripped = flags & ~stripped;
> -
> -		/* switch duplicated blocks with raid1 */
> -		if (flags & BTRFS_BLOCK_GROUP_DUP)
> -			return stripped | BTRFS_BLOCK_GROUP_RAID1;
> -
> -		/* this is drive concat, leave it alone */
> -	}
> -
> -	return flags;
> -}
> -
>  int btrfs_inc_block_group_ro(struct btrfs_block_group *cache)
>  
>  {
> @@ -2058,7 +2010,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache)
>  	 * if we are changing raid levels, try to allocate a corresponding
>  	 * block group with the new raid level.
>  	 */
> -	alloc_flags = update_block_group_flags(fs_info, cache->flags);
> +	alloc_flags = get_alloc_profile(fs_info, cache->flags);
>  	if (alloc_flags != cache->flags) {
>  		ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
>  		/*
> @@ -2082,7 +2034,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache)
>  	ret = inc_block_group_ro(cache, 0);
>  out:
>  	if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
> -		alloc_flags = update_block_group_flags(fs_info, cache->flags);
> +		alloc_flags = get_alloc_profile(fs_info, cache->flags);
>  		mutex_lock(&fs_info->chunk_mutex);
>  		check_system_chunk(trans, alloc_flags);
>  		mutex_unlock(&fs_info->chunk_mutex);
>
Josef Bacik Jan. 7, 2020, 3:09 p.m. UTC | #2
On 1/7/20 6:08 AM, Qu Wenruo wrote:
> 
> 
> On 2020/1/7 上午12:50, Josef Bacik wrote:
>> btrfs/061 has been failing consistently for me recently with a
>> transaction abort.  We run out of space in the system chunk array, which
>> means we've allocated way too many system chunks than we need.
> 
> Isn't that caused by scrubbing creating unnecessary system chunks?
> 
> IIRC I had a patch to address that problem by just simply not allocating
> system chunks for scrub.
> ("btrfs: scrub: Don't check free space before marking a block  group RO")
> 

This addresses the symptoms, not the root cause of the problem.  Your fix is 
valid, because we probably shouldn't be doing that, but we also shouldn't be 
forcing restriping of block groups arbitrarily.

> Although that doesn't address the whole problem, but it should at least
> reduce the possibility.
> 
> 
> Furthermore, with the newer over-commit behavior for inc_block_group_ro
> ("btrfs: use btrfs_can_overcommit in inc_block_group_ro"), we won't
> really allocate new system chunks anymore if we can over-commit.
> 
> With those two patches, I guess we should have solved the problem.
> Or did I miss something?
> 
You are missing that we're getting forced to allocate a system chunk from this

alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);

which you move down in your patch, but will still get tripped by rebalance.  So 
you sort of paper over the real problem, we just don't get bitten by it as hard 
with 061 because balance takes longer than scrub does.  If we let it run longer 
per fs type we'd still hit the same problem.

In short, your patches do make it better, and are definitely correct because we 
probably shouldn't be allocating new chunks for scrub, but they don't address 
the real cause of the problem.  All the patches are needed.  Thanks,

Josef
Qu Wenruo Jan. 8, 2020, 5:36 a.m. UTC | #3
On 2020/1/7 下午11:09, Josef Bacik wrote:
> On 1/7/20 6:08 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/1/7 上午12:50, Josef Bacik wrote:
>>> btrfs/061 has been failing consistently for me recently with a
>>> transaction abort.  We run out of space in the system chunk array, which
>>> means we've allocated way too many system chunks than we need.
>>
>> Isn't that caused by scrubbing creating unnecessary system chunks?
>>
>> IIRC I had a patch to address that problem by just simply not allocating
>> system chunks for scrub.
>> ("btrfs: scrub: Don't check free space before marking a block  group RO")
>>
> 
> This addresses the symptoms, not the root cause of the problem.  Your
> fix is valid, because we probably shouldn't be doing that, but we also
> shouldn't be forcing restriping of block groups arbitrarily.
> 
>> Although that doesn't address the whole problem, but it should at least
>> reduce the possibility.
>>
>>
>> Furthermore, with the newer over-commit behavior for inc_block_group_ro
>> ("btrfs: use btrfs_can_overcommit in inc_block_group_ro"), we won't
>> really allocate new system chunks anymore if we can over-commit.
>>
>> With those two patches, I guess we should have solved the problem.
>> Or did I miss something?
>>
> You are missing that we're getting forced to allocate a system chunk
> from this
> 
> alloc_flags = update_block_group_flags(fs_info, cache->flags);
> if (alloc_flags != cache->flags) {
>     ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
> 
> which you move down in your patch, but will still get tripped by
> rebalance.  So you sort of paper over the real problem, we just don't
> get bitten by it as hard with 061 because balance takes longer than
> scrub does.  If we let it run longer per fs type we'd still hit the same
> problem.
> 
> In short, your patches do make it better, and are definitely correct
> because we probably shouldn't be allocating new chunks for scrub, but
> they don't address the real cause of the problem.  All the patches are
> needed.  Thanks,

Indeed.

Then the patch looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

And thanks again for fixing the missing piece of the unnecessary chunk
allocation.

Thanks,
Qu

> 
> Josef
David Sterba Jan. 8, 2020, 5:03 p.m. UTC | #4
On Mon, Jan 06, 2020 at 11:50:15AM -0500, Josef Bacik wrote:
> btrfs/061 has been failing consistently for me recently with a
> transaction abort.  We run out of space in the system chunk array, which
> means we've allocated way too many system chunks than we need.
> 
> Chris added this a long time ago for balance as a poor mans restriping.
> If you had a single disk and then added another disk and then did a
> balance, update_block_group_flags would then figure out which RAID level
> you needed.
> 
> Fast forward to today and we have restriping behavior, so we can
> explicitly tell the fs that we're trying to change the raid level.  This
> is accomplished through the normal get_alloc_profile path.
> 
> Furthermore this code actually causes btrfs/061 to fail, because we do
> things like mkfs -m dup -d single with multiple devices.  This trips
> this check
> 
> alloc_flags = update_block_group_flags(fs_info, cache->flags);
> if (alloc_flags != cache->flags) {
> 	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
> 
> in btrfs_inc_block_group_ro.  Because we're balancing and scrubbing, but
> not actually restriping, we keep forcing chunk allocation of RAID1
> chunks.  This eventually causes us to run out of system space and the
> file system aborts and flips read only.
> 
> We don't need this poor mans restriping any more, simply use the normal
> get_alloc_profile helper, which will get the correct alloc_flags and
> thus make the right decision for chunk allocation.  This keeps us from
> allocating a billion system chunks and falling over.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/block-group.c | 52 ++----------------------------------------
>  1 file changed, 2 insertions(+), 50 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index c79eccf188c5..0257e6f1efb1 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1975,54 +1975,6 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
>  	return 0;
>  }
>  
> -static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
> -{
> -	u64 num_devices;
> -	u64 stripped;
> -
> -	/*
> -	 * if restripe for this chunk_type is on pick target profile and
> -	 * return, otherwise do the usual balance
> -	 */
> -	stripped = get_restripe_target(fs_info, flags);
> -	if (stripped)
> -		return extended_to_chunk(stripped);
> -
> -	num_devices = fs_info->fs_devices->rw_devices;
> -
> -	stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID56_MASK |
> -		BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
> -
> -	if (num_devices == 1) {
> -		stripped |= BTRFS_BLOCK_GROUP_DUP;
> -		stripped = flags & ~stripped;
> -
> -		/* turn raid0 into single device chunks */
> -		if (flags & BTRFS_BLOCK_GROUP_RAID0)
> -			return stripped;
> -
> -		/* turn mirroring into duplication */
> -		if (flags & (BTRFS_BLOCK_GROUP_RAID1_MASK |
> -			     BTRFS_BLOCK_GROUP_RAID10))
> -			return stripped | BTRFS_BLOCK_GROUP_DUP;
> -	} else {
> -		/* they already had raid on here, just return */
> -		if (flags & stripped)
> -			return flags;
> -
> -		stripped |= BTRFS_BLOCK_GROUP_DUP;
> -		stripped = flags & ~stripped;
> -
> -		/* switch duplicated blocks with raid1 */
> -		if (flags & BTRFS_BLOCK_GROUP_DUP)
> -			return stripped | BTRFS_BLOCK_GROUP_RAID1;
> -
> -		/* this is drive concat, leave it alone */
> -	}

I remember that I ended up in that function while testing the raid1c34
feature, converting from one profile to another, but can't recall the
details now. I think the fallback to the profiles did occur here in some
cases so we need to make sure that the same usecases are still
supported.
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index c79eccf188c5..0257e6f1efb1 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1975,54 +1975,6 @@  int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	return 0;
 }
 
-static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
-{
-	u64 num_devices;
-	u64 stripped;
-
-	/*
-	 * if restripe for this chunk_type is on pick target profile and
-	 * return, otherwise do the usual balance
-	 */
-	stripped = get_restripe_target(fs_info, flags);
-	if (stripped)
-		return extended_to_chunk(stripped);
-
-	num_devices = fs_info->fs_devices->rw_devices;
-
-	stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID56_MASK |
-		BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
-
-	if (num_devices == 1) {
-		stripped |= BTRFS_BLOCK_GROUP_DUP;
-		stripped = flags & ~stripped;
-
-		/* turn raid0 into single device chunks */
-		if (flags & BTRFS_BLOCK_GROUP_RAID0)
-			return stripped;
-
-		/* turn mirroring into duplication */
-		if (flags & (BTRFS_BLOCK_GROUP_RAID1_MASK |
-			     BTRFS_BLOCK_GROUP_RAID10))
-			return stripped | BTRFS_BLOCK_GROUP_DUP;
-	} else {
-		/* they already had raid on here, just return */
-		if (flags & stripped)
-			return flags;
-
-		stripped |= BTRFS_BLOCK_GROUP_DUP;
-		stripped = flags & ~stripped;
-
-		/* switch duplicated blocks with raid1 */
-		if (flags & BTRFS_BLOCK_GROUP_DUP)
-			return stripped | BTRFS_BLOCK_GROUP_RAID1;
-
-		/* this is drive concat, leave it alone */
-	}
-
-	return flags;
-}
-
 int btrfs_inc_block_group_ro(struct btrfs_block_group *cache)
 
 {
@@ -2058,7 +2010,7 @@  int btrfs_inc_block_group_ro(struct btrfs_block_group *cache)
 	 * if we are changing raid levels, try to allocate a corresponding
 	 * block group with the new raid level.
 	 */
-	alloc_flags = update_block_group_flags(fs_info, cache->flags);
+	alloc_flags = get_alloc_profile(fs_info, cache->flags);
 	if (alloc_flags != cache->flags) {
 		ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
 		/*
@@ -2082,7 +2034,7 @@  int btrfs_inc_block_group_ro(struct btrfs_block_group *cache)
 	ret = inc_block_group_ro(cache, 0);
 out:
 	if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) {
-		alloc_flags = update_block_group_flags(fs_info, cache->flags);
+		alloc_flags = get_alloc_profile(fs_info, cache->flags);
 		mutex_lock(&fs_info->chunk_mutex);
 		check_system_chunk(trans, alloc_flags);
 		mutex_unlock(&fs_info->chunk_mutex);