diff mbox series

[3/3] btrfs: volumes: Allocate degraded chunks if rw devices can't fullfil a chunk

Message ID 20191107062710.67964-4-wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: More intelligent degraded chunk allocator | expand

Commit Message

Qu Wenruo Nov. 7, 2019, 6:27 a.m. UTC
[PROBLEM]
Btrfs degraded mount will fallback to SINGLE profile if there are not
enough devices:

 # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1
 # wipefs -fa /dev/test/scratch2
 # mount -o degraded /dev/test/scratch1 /mnt/btrfs
 # fallocate -l 1G /mnt/btrfs/foobar
 # btrfs ins dump-tree -t chunk /dev/test/scratch1
        item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff 15511 itemsize 80
                length 536870912 owner 2 stripe_len 65536 type DATA
 New data chunk will fallback to SINGLE.

If user doesn't balance those SINGLE chunks, even with missing devices
replaced, the fs is no longer full RAID1, and a missing device can break
the tolerance.

[CAUSE]
The cause is pretty simple, when mounted degraded, missing devices can't
be used for chunk allocation.
Thus btrfs has to fall back to SINGLE profile.

[ENHANCEMENT]
To avoid such problem, this patch will:
- Make all profiler reducer/updater to consider missing devices as part
  of num_devices
- Make chunk allocator to fallback to missing_list as last resort

If we have enough rw_devices, then go regular chunk allocation code.
This can avoid allocating degraded chunks.
E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing
devices to allocate chunk, avoid degraded chunk.

But if we don't have enough rw_devices, then we check missing devices to
allocate degraded chunks.
E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded
chunks to keep the RAID1 profile.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/block-group.c | 10 +++++++---
 fs/btrfs/volumes.c     | 18 +++++++++++++++---
 2 files changed, 22 insertions(+), 6 deletions(-)

Comments

Anand Jain Nov. 19, 2019, 10:05 a.m. UTC | #1
On 11/7/19 2:27 PM, Qu Wenruo wrote:
> [PROBLEM]
> Btrfs degraded mount will fallback to SINGLE profile if there are not
> enough devices:

  Its better to keep it like this for now until there is a fix for the
  write hole. Otherwise hitting the write hole bug in case of degraded
  raid1 will be more prevalent.

  I proposed a RFC a long time before [1] (also in there, there
  is a commit id which turned the degraded raid1 profile into single
  profile (without much write-up on it)).

    [1] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks

  Similarly the patch related to the reappearing missing device [2]
  falls under the same category. Will push for the integration after
  the write hole fix.

    [2] [PATCH] btrfs: handle dynamically reappearing missing device
    (test case 154).

  If you look close enough the original author has quite nicely made
  sure write hole bug will be very difficultly to hit. These fixes
  shall make it easy to hit. So its better to work on the write hole
  first.

  I am trying to fix write hole. First attempt has limited success
  (works fine in two disk raid1 only). Now trying other ways to fix.

>   # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1
>   # wipefs -fa /dev/test/scratch2
>   # mount -o degraded /dev/test/scratch1 /mnt/btrfs
>   # fallocate -l 1G /mnt/btrfs/foobar
>   # btrfs ins dump-tree -t chunk /dev/test/scratch1
>          item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff 15511 itemsize 80
>                  length 536870912 owner 2 stripe_len 65536 type DATA
>   New data chunk will fallback to SINGLE.
> 
> If user doesn't balance those SINGLE chunks, even with missing devices
> replaced, the fs is no longer full RAID1, and a missing device can break
> the tolerance.

  As its been discussed quite a lot of time before, the current
  re-silver/recovery approach for degraded-raid1 (with offload to Single)
  requires balance. Its kind of known.

Thanks, Anand


> [CAUSE]
> The cause is pretty simple, when mounted degraded, missing devices can't
> be used for chunk allocation.
> Thus btrfs has to fall back to SINGLE profile.
> 
> [ENHANCEMENT]
> To avoid such problem, this patch will:
> - Make all profiler reducer/updater to consider missing devices as part
>    of num_devices
> - Make chunk allocator to fallback to missing_list as last resort
> 
> If we have enough rw_devices, then go regular chunk allocation code.

> This can avoid allocating degraded chunks.
> E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing
> devices to allocate chunk, avoid degraded chunk.

> But if we don't have enough rw_devices, then we check missing devices to
> allocate degraded chunks.
> E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded
> chunks to keep the RAID1 profile.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>   fs/btrfs/block-group.c | 10 +++++++---
>   fs/btrfs/volumes.c     | 18 +++++++++++++++---
>   2 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index bf7e3f23bba7..1686fd31679b 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -52,11 +52,13 @@ static u64 get_restripe_target(struct btrfs_fs_info *fs_info, u64 flags)
>    */
>   static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags)
>   {
> -	u64 num_devices = fs_info->fs_devices->rw_devices;
> +	u64 num_devices;
>   	u64 target;
>   	u64 raid_type;
>   	u64 allowed = 0;
>   
> +	num_devices = fs_info->fs_devices->rw_devices +
> +		      fs_info->fs_devices->missing_devices;
>   	/*
>   	 * See if restripe for this chunk_type is in progress, if so try to
>   	 * reduce to the target profile
> @@ -1986,7 +1988,8 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
>   	if (stripped)
>   		return extended_to_chunk(stripped);
>   
> -	num_devices = fs_info->fs_devices->rw_devices;
> +	num_devices = fs_info->fs_devices->rw_devices +
> +		      fs_info->fs_devices->missing_devices;
>   
>   	stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID56_MASK |
>   		BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
> @@ -2981,7 +2984,8 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
>   
>   	num_dev = btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
>   	if (!num_dev)
> -		num_dev = fs_info->fs_devices->rw_devices;
> +		num_dev = fs_info->fs_devices->rw_devices +
> +			  fs_info->fs_devices->missing_devices;
>   
>   	return num_dev;
>   }
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a462d8de5d2a..4dee1974ceb7 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5052,8 +5052,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
>   			     max_chunk_size);
>   
> -	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
> -			       GFP_NOFS);
> +	devices_info = kcalloc(fs_devices->rw_devices +
> +			       fs_devices->missing_devices,
> +			       sizeof(*devices_info), GFP_NOFS);
>   	if (!devices_info)
>   		return -ENOMEM;
>   
> @@ -5067,7 +5068,18 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>   			max_stripe_size, dev_stripes);
>   	if (ret < 0)
>   		goto error;
> -
> +	/*
> +	 * If rw devices can't fullfil the request, fallback to missing devices
> +	 * as last resort.
> +	 */
> +	if (ndevs < devs_min) {
> +		ret = gather_dev_holes(info, devices_info + ndevs, &ndevs,
> +				&fs_devices->missing_list,
> +				fs_devices->missing_devices,
> +				max_stripe_size, dev_stripes);
> +		if (ret < 0)
> +			goto error;
> +	}
>   	/*
>   	 * now sort the devices by hole size / available space
>   	 */
>
Qu Wenruo Nov. 19, 2019, 10:41 a.m. UTC | #2
On 2019/11/19 下午6:05, Anand Jain wrote:
> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>> [PROBLEM]
>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>> enough devices:
> 
>  Its better to keep it like this for now until there is a fix for the
>  write hole. Otherwise hitting the write hole bug in case of degraded
>  raid1 will be more prevalent.

Write hole should be a problem for RAID5/6, not the degraded chunk
feature itself.

Furthermore, this design will try to avoid allocating chunks using
missing devices.
So even for 3 devices RAID5, new chunks will be allocated by using
existing devices (2 devices RAID5), so no new write hole is introduced.

> 
>  I proposed a RFC a long time before [1] (also in there, there
>  is a commit id which turned the degraded raid1 profile into single
>  profile (without much write-up on it)).
> 
>    [1] [PATCH 0/2] [RFC] btrfs: create degraded-RAID1 chunks

My point for this patchset is:
- Create regular chunk if we have enough devices
- Create degraded chunk only when we have not enough devices

I guess since you didn't get the point of my preparation patches, your
patches aren't that good to avoid missing devices.

> 
>  Similarly the patch related to the reappearing missing device [2]
>  falls under the same category. Will push for the integration after
>  the write hole fix.
> 
>    [2] [PATCH] btrfs: handle dynamically reappearing missing device
>    (test case 154).

That's another case, and I didn't see how it affects this feature.

> 
>  If you look close enough the original author has quite nicely made
>  sure write hole bug will be very difficultly to hit. These fixes
>  shall make it easy to hit. So its better to work on the write hole
>  first.

If you're talking about RAID5/6, you are talking at the wrong thread.
Go implement some write-a-head log for RAID5/6, or mark all degraded
RAID5/6 chunks read-only at mount time.

> 
>  I am trying to fix write hole. First attempt has limited success
>  (works fine in two disk raid1 only). Now trying other ways to fix.
> 
>>   # mkfs.btrfs -f /dev/test/scratch[12] -m raid1 -d raid1
>>   # wipefs -fa /dev/test/scratch2
>>   # mount -o degraded /dev/test/scratch1 /mnt/btrfs
>>   # fallocate -l 1G /mnt/btrfs/foobar
>>   # btrfs ins dump-tree -t chunk /dev/test/scratch1
>>          item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 1674575872) itemoff
>> 15511 itemsize 80
>>                  length 536870912 owner 2 stripe_len 65536 type DATA
>>   New data chunk will fallback to SINGLE.
>>
>> If user doesn't balance those SINGLE chunks, even with missing devices
>> replaced, the fs is no longer full RAID1, and a missing device can break
>> the tolerance.
> 
>  As its been discussed quite a lot of time before, the current
>  re-silver/recovery approach for degraded-raid1 (with offload to Single)
>  requires balance. Its kind of known.

I'd call such "well-known" behavior BS.

All other raid1 implementation can accept single device RAID1 and
resilver itself with more device into a full RAID1 setup.

But for BTRFS you're calling SINGLE profile "well-known"?
It's "well-known" because it's not working properly, that's why I'm
trying to fix it.


> 
> Thanks, Anand
> 
> 
>> [CAUSE]
>> The cause is pretty simple, when mounted degraded, missing devices can't
>> be used for chunk allocation.
>> Thus btrfs has to fall back to SINGLE profile.
>>
>> [ENHANCEMENT]
>> To avoid such problem, this patch will:
>> - Make all profiler reducer/updater to consider missing devices as part
>>    of num_devices
>> - Make chunk allocator to fallback to missing_list as last resort
>>
>> If we have enough rw_devices, then go regular chunk allocation code.
> 
>> This can avoid allocating degraded chunks.
>> E.g. for 3 devices RAID1 degraded mount, we will use the 2 existing
>> devices to allocate chunk, avoid degraded chunk.
> 
>> But if we don't have enough rw_devices, then we check missing devices to
>> allocate degraded chunks.
>> E.g. for 2 devices RAID1 degraded mount, we have to allocate degraded
>> chunks to keep the RAID1 profile.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/block-group.c | 10 +++++++---
>>   fs/btrfs/volumes.c     | 18 +++++++++++++++---
>>   2 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
>> index bf7e3f23bba7..1686fd31679b 100644
>> --- a/fs/btrfs/block-group.c
>> +++ b/fs/btrfs/block-group.c
>> @@ -52,11 +52,13 @@ static u64 get_restripe_target(struct
>> btrfs_fs_info *fs_info, u64 flags)
>>    */
>>   static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info,
>> u64 flags)
>>   {
>> -    u64 num_devices = fs_info->fs_devices->rw_devices;
>> +    u64 num_devices;
>>       u64 target;
>>       u64 raid_type;
>>       u64 allowed = 0;
>>   +    num_devices = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>       /*
>>        * See if restripe for this chunk_type is in progress, if so try to
>>        * reduce to the target profile
>> @@ -1986,7 +1988,8 @@ static u64 update_block_group_flags(struct
>> btrfs_fs_info *fs_info, u64 flags)
>>       if (stripped)
>>           return extended_to_chunk(stripped);
>>   -    num_devices = fs_info->fs_devices->rw_devices;
>> +    num_devices = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>         stripped = BTRFS_BLOCK_GROUP_RAID0 |
>> BTRFS_BLOCK_GROUP_RAID56_MASK |
>>           BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
>> @@ -2981,7 +2984,8 @@ static u64 get_profile_num_devs(struct
>> btrfs_fs_info *fs_info, u64 type)
>>         num_dev =
>> btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
>>       if (!num_dev)
>> -        num_dev = fs_info->fs_devices->rw_devices;
>> +        num_dev = fs_info->fs_devices->rw_devices +
>> +              fs_info->fs_devices->missing_devices;
>>         return num_dev;
>>   }
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index a462d8de5d2a..4dee1974ceb7 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -5052,8 +5052,9 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>>       max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
>>                    max_chunk_size);
>>   -    devices_info = kcalloc(fs_devices->rw_devices,
>> sizeof(*devices_info),
>> -                   GFP_NOFS);
>> +    devices_info = kcalloc(fs_devices->rw_devices +
>> +                   fs_devices->missing_devices,
>> +                   sizeof(*devices_info), GFP_NOFS);
>>       if (!devices_info)
>>           return -ENOMEM;
>>   @@ -5067,7 +5068,18 @@ static int __btrfs_alloc_chunk(struct
>> btrfs_trans_handle *trans,
>>               max_stripe_size, dev_stripes);
>>       if (ret < 0)
>>           goto error;
>> -
>> +    /*
>> +     * If rw devices can't fullfil the request, fallback to missing
>> devices
>> +     * as last resort.
>> +     */
>> +    if (ndevs < devs_min) {
>> +        ret = gather_dev_holes(info, devices_info + ndevs, &ndevs,
>> +                &fs_devices->missing_list,
>> +                fs_devices->missing_devices,
>> +                max_stripe_size, dev_stripes);
>> +        if (ret < 0)
>> +            goto error;
>> +    }
>>       /*
>>        * now sort the devices by hole size / available space
>>        */
>>
>
David Sterba Nov. 27, 2019, 7:23 p.m. UTC | #3
On Tue, Nov 19, 2019 at 06:41:49PM +0800, Qu Wenruo wrote:
> On 2019/11/19 下午6:05, Anand Jain wrote:
> > On 11/7/19 2:27 PM, Qu Wenruo wrote:
> >> [PROBLEM]
> >> Btrfs degraded mount will fallback to SINGLE profile if there are not
> >> enough devices:
> > 
> >  Its better to keep it like this for now until there is a fix for the
> >  write hole. Otherwise hitting the write hole bug in case of degraded
> >  raid1 will be more prevalent.
> 
> Write hole should be a problem for RAID5/6, not the degraded chunk
> feature itself.
> 
> Furthermore, this design will try to avoid allocating chunks using
> missing devices.
> So even for 3 devices RAID5, new chunks will be allocated by using
> existing devices (2 devices RAID5), so no new write hole is introduced.

That this would allow a 2 device raid5 (from expected 3) is similar to
the reduced chunks, but now hidden because we don't have a detailed
report for stripes on devices. And rebalance would be needed to make
sure that's the filesystem is again 3 devices (and 1 device lost
tolerant).

This is different to the 1 device missing for raid1, where scrub can
fix that (expected), but the balance is IMHO not.

I'd suggest to allow allocation from missing devices only from the
profiles with redundancy. For now.
Qu Wenruo Nov. 27, 2019, 11:36 p.m. UTC | #4
On 2019/11/28 上午3:23, David Sterba wrote:
> On Tue, Nov 19, 2019 at 06:41:49PM +0800, Qu Wenruo wrote:
>> On 2019/11/19 下午6:05, Anand Jain wrote:
>>> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>>>> [PROBLEM]
>>>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>>>> enough devices:
>>>
>>>  Its better to keep it like this for now until there is a fix for the
>>>  write hole. Otherwise hitting the write hole bug in case of degraded
>>>  raid1 will be more prevalent.
>>
>> Write hole should be a problem for RAID5/6, not the degraded chunk
>> feature itself.
>>
>> Furthermore, this design will try to avoid allocating chunks using
>> missing devices.
>> So even for 3 devices RAID5, new chunks will be allocated by using
>> existing devices (2 devices RAID5), so no new write hole is introduced.
> 
> That this would allow a 2 device raid5 (from expected 3) is similar to
> the reduced chunks, but now hidden because we don't have a detailed
> report for stripes on devices. And rebalance would be needed to make
> sure that's the filesystem is again 3 devices (and 1 device lost
> tolerant).
> 
> This is different to the 1 device missing for raid1, where scrub can
> fix that (expected), but the balance is IMHO not.
> 
> I'd suggest to allow allocation from missing devices only from the
> profiles with redundancy. For now.

But RAID5 itself supports 2 devices, right?
And even 2 devices RAID5 can still tolerant 1 missing device.

The tolerance hasn't changed in that case, just unbalanced disk usage then.

Thanks,
Qu
David Sterba Nov. 28, 2019, 11:24 a.m. UTC | #5
On Thu, Nov 28, 2019 at 07:36:41AM +0800, Qu Wenruo wrote:
> On 2019/11/28 上午3:23, David Sterba wrote:
> > On Tue, Nov 19, 2019 at 06:41:49PM +0800, Qu Wenruo wrote:
> >> On 2019/11/19 下午6:05, Anand Jain wrote:
> >>> On 11/7/19 2:27 PM, Qu Wenruo wrote:
> >>>> [PROBLEM]
> >>>> Btrfs degraded mount will fallback to SINGLE profile if there are not
> >>>> enough devices:
> >>>
> >>>  Its better to keep it like this for now until there is a fix for the
> >>>  write hole. Otherwise hitting the write hole bug in case of degraded
> >>>  raid1 will be more prevalent.
> >>
> >> Write hole should be a problem for RAID5/6, not the degraded chunk
> >> feature itself.
> >>
> >> Furthermore, this design will try to avoid allocating chunks using
> >> missing devices.
> >> So even for 3 devices RAID5, new chunks will be allocated by using
> >> existing devices (2 devices RAID5), so no new write hole is introduced.
> > 
> > That this would allow a 2 device raid5 (from expected 3) is similar to
> > the reduced chunks, but now hidden because we don't have a detailed
> > report for stripes on devices. And rebalance would be needed to make
> > sure that's the filesystem is again 3 devices (and 1 device lost
> > tolerant).
> > 
> > This is different to the 1 device missing for raid1, where scrub can
> > fix that (expected), but the balance is IMHO not.
> > 
> > I'd suggest to allow allocation from missing devices only from the
> > profiles with redundancy. For now.
> 
> But RAID5 itself supports 2 devices, right?
> And even 2 devices RAID5 can still tolerant 1 missing device.

> The tolerance hasn't changed in that case, just unbalanced disk usage then.

Ah right, the constraints are still fine. That the usage is unbalanced
is something I'd still consider a problem because it's silently changing
the layout from the one that was set by user.

As there are two conflicting ways to continue from the missing device state:

- try to use remaining devices to allow writes but change the layout
- don't allow writes, let user/admin sort it out

I'd rather have more time to understand the implications and try to
experiment with that.
Qu Wenruo Nov. 28, 2019, 12:29 p.m. UTC | #6
On 2019/11/28 下午7:24, David Sterba wrote:
> On Thu, Nov 28, 2019 at 07:36:41AM +0800, Qu Wenruo wrote:
>> On 2019/11/28 上午3:23, David Sterba wrote:
>>> On Tue, Nov 19, 2019 at 06:41:49PM +0800, Qu Wenruo wrote:
>>>> On 2019/11/19 下午6:05, Anand Jain wrote:
>>>>> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>>>>>> [PROBLEM]
>>>>>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>>>>>> enough devices:
>>>>>
>>>>>  Its better to keep it like this for now until there is a fix for the
>>>>>  write hole. Otherwise hitting the write hole bug in case of degraded
>>>>>  raid1 will be more prevalent.
>>>>
>>>> Write hole should be a problem for RAID5/6, not the degraded chunk
>>>> feature itself.
>>>>
>>>> Furthermore, this design will try to avoid allocating chunks using
>>>> missing devices.
>>>> So even for 3 devices RAID5, new chunks will be allocated by using
>>>> existing devices (2 devices RAID5), so no new write hole is introduced.
>>>
>>> That this would allow a 2 device raid5 (from expected 3) is similar to
>>> the reduced chunks, but now hidden because we don't have a detailed
>>> report for stripes on devices. And rebalance would be needed to make
>>> sure that's the filesystem is again 3 devices (and 1 device lost
>>> tolerant).
>>>
>>> This is different to the 1 device missing for raid1, where scrub can
>>> fix that (expected), but the balance is IMHO not.
>>>
>>> I'd suggest to allow allocation from missing devices only from the
>>> profiles with redundancy. For now.
>>
>> But RAID5 itself supports 2 devices, right?
>> And even 2 devices RAID5 can still tolerant 1 missing device.
> 
>> The tolerance hasn't changed in that case, just unbalanced disk usage then.
> 
> Ah right, the constraints are still fine. That the usage is unbalanced
> is something I'd still consider a problem because it's silently changing
> the layout from the one that was set by user.

Then it's the trade off between:
- Completely the same layout
- Just tolerance

I'd say, even without missing device, btrfs by its nature, it can't
ensure all chunks are allocated using the same device layout.

E.g. 4 disk raid5, 1T + 2T + 2T + 2T.
There will be a point where btrfs can only go 3 disks RAID5 other than 4.

> 
> As there are two conflicting ways to continue from the missing device state:
> 
> - try to use remaining devices to allow writes but change the layout
> - don't allow writes, let user/admin sort it out
> 
> I'd rather have more time to understand the implications and try to
> experiment with that.

Sure, no problem.

Thanks,
Qu
Qu Wenruo Nov. 28, 2019, 12:30 p.m. UTC | #7
On 2019/11/28 下午7:24, David Sterba wrote:
> On Thu, Nov 28, 2019 at 07:36:41AM +0800, Qu Wenruo wrote:
>> On 2019/11/28 上午3:23, David Sterba wrote:
>>> On Tue, Nov 19, 2019 at 06:41:49PM +0800, Qu Wenruo wrote:
>>>> On 2019/11/19 下午6:05, Anand Jain wrote:
>>>>> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>>>>>> [PROBLEM]
>>>>>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>>>>>> enough devices:
>>>>>
>>>>>  Its better to keep it like this for now until there is a fix for the
>>>>>  write hole. Otherwise hitting the write hole bug in case of degraded
>>>>>  raid1 will be more prevalent.
>>>>
>>>> Write hole should be a problem for RAID5/6, not the degraded chunk
>>>> feature itself.
>>>>
>>>> Furthermore, this design will try to avoid allocating chunks using
>>>> missing devices.
>>>> So even for 3 devices RAID5, new chunks will be allocated by using
>>>> existing devices (2 devices RAID5), so no new write hole is introduced.
>>>
>>> That this would allow a 2 device raid5 (from expected 3) is similar to
>>> the reduced chunks, but now hidden because we don't have a detailed
>>> report for stripes on devices. And rebalance would be needed to make
>>> sure that's the filesystem is again 3 devices (and 1 device lost
>>> tolerant).
>>>
>>> This is different to the 1 device missing for raid1, where scrub can
>>> fix that (expected), but the balance is IMHO not.
>>>
>>> I'd suggest to allow allocation from missing devices only from the
>>> profiles with redundancy. For now.
>>
>> But RAID5 itself supports 2 devices, right?
>> And even 2 devices RAID5 can still tolerant 1 missing device.
> 
>> The tolerance hasn't changed in that case, just unbalanced disk usage then.
> 
> Ah right, the constraints are still fine. That the usage is unbalanced
> is something I'd still consider a problem because it's silently changing
> the layout from the one that was set by user.
> 
> As there are two conflicting ways to continue from the missing device state:
> 
> - try to use remaining devices to allow writes but change the layout
> - don't allow writes, let user/admin sort it out
> 
> I'd rather have more time to understand the implications and try to
> experiment with that.
> 
Ah, makes sense.

So no need for a new version.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
Qu Wenruo Nov. 28, 2019, 12:39 p.m. UTC | #8
On 2019/11/28 下午8:30, Qu WenRuo wrote:
> 
> 
> On 2019/11/28 下午7:24, David Sterba wrote:
>> On Thu, Nov 28, 2019 at 07:36:41AM +0800, Qu Wenruo wrote:
>>> On 2019/11/28 上午3:23, David Sterba wrote:
>>>> On Tue, Nov 19, 2019 at 06:41:49PM +0800, Qu Wenruo wrote:
>>>>> On 2019/11/19 下午6:05, Anand Jain wrote:
>>>>>> On 11/7/19 2:27 PM, Qu Wenruo wrote:
>>>>>>> [PROBLEM]
>>>>>>> Btrfs degraded mount will fallback to SINGLE profile if there are not
>>>>>>> enough devices:
>>>>>>
>>>>>>  Its better to keep it like this for now until there is a fix for the
>>>>>>  write hole. Otherwise hitting the write hole bug in case of degraded
>>>>>>  raid1 will be more prevalent.
>>>>>
>>>>> Write hole should be a problem for RAID5/6, not the degraded chunk
>>>>> feature itself.
>>>>>
>>>>> Furthermore, this design will try to avoid allocating chunks using
>>>>> missing devices.
>>>>> So even for 3 devices RAID5, new chunks will be allocated by using
>>>>> existing devices (2 devices RAID5), so no new write hole is introduced.
>>>>
>>>> That this would allow a 2 device raid5 (from expected 3) is similar to
>>>> the reduced chunks, but now hidden because we don't have a detailed
>>>> report for stripes on devices. And rebalance would be needed to make
>>>> sure that's the filesystem is again 3 devices (and 1 device lost
>>>> tolerant).
>>>>
>>>> This is different to the 1 device missing for raid1, where scrub can
>>>> fix that (expected), but the balance is IMHO not.
>>>>
>>>> I'd suggest to allow allocation from missing devices only from the
>>>> profiles with redundancy. For now.
>>>
>>> But RAID5 itself supports 2 devices, right?
>>> And even 2 devices RAID5 can still tolerant 1 missing device.
>>
>>> The tolerance hasn't changed in that case, just unbalanced disk usage then.
>>
>> Ah right, the constraints are still fine. That the usage is unbalanced
>> is something I'd still consider a problem because it's silently changing
>> the layout from the one that was set by user.
>>
>> As there are two conflicting ways to continue from the missing device state:
>>
>> - try to use remaining devices to allow writes but change the layout
>> - don't allow writes, let user/admin sort it out
>>
>> I'd rather have more time to understand the implications and try to
>> experiment with that.
>>
> Ah, makes sense.
> 
> So no need for a new version.
> 
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> 
> Thanks,
> Qu
> 
Facepalm, that's for another thread....

Reviewing patch from myself, WTF....
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index bf7e3f23bba7..1686fd31679b 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -52,11 +52,13 @@  static u64 get_restripe_target(struct btrfs_fs_info *fs_info, u64 flags)
  */
 static u64 btrfs_reduce_alloc_profile(struct btrfs_fs_info *fs_info, u64 flags)
 {
-	u64 num_devices = fs_info->fs_devices->rw_devices;
+	u64 num_devices;
 	u64 target;
 	u64 raid_type;
 	u64 allowed = 0;
 
+	num_devices = fs_info->fs_devices->rw_devices +
+		      fs_info->fs_devices->missing_devices;
 	/*
 	 * See if restripe for this chunk_type is in progress, if so try to
 	 * reduce to the target profile
@@ -1986,7 +1988,8 @@  static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 	if (stripped)
 		return extended_to_chunk(stripped);
 
-	num_devices = fs_info->fs_devices->rw_devices;
+	num_devices = fs_info->fs_devices->rw_devices +
+		      fs_info->fs_devices->missing_devices;
 
 	stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID56_MASK |
 		BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
@@ -2981,7 +2984,8 @@  static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
 
 	num_dev = btrfs_raid_array[btrfs_bg_flags_to_raid_index(type)].devs_max;
 	if (!num_dev)
-		num_dev = fs_info->fs_devices->rw_devices;
+		num_dev = fs_info->fs_devices->rw_devices +
+			  fs_info->fs_devices->missing_devices;
 
 	return num_dev;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a462d8de5d2a..4dee1974ceb7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5052,8 +5052,9 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
 			     max_chunk_size);
 
-	devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
-			       GFP_NOFS);
+	devices_info = kcalloc(fs_devices->rw_devices +
+			       fs_devices->missing_devices,
+			       sizeof(*devices_info), GFP_NOFS);
 	if (!devices_info)
 		return -ENOMEM;
 
@@ -5067,7 +5068,18 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			max_stripe_size, dev_stripes);
 	if (ret < 0)
 		goto error;
-
+	/*
+	 * If rw devices can't fullfil the request, fallback to missing devices
+	 * as last resort.
+	 */
+	if (ndevs < devs_min) {
+		ret = gather_dev_holes(info, devices_info + ndevs, &ndevs,
+				&fs_devices->missing_list,
+				fs_devices->missing_devices,
+				max_stripe_size, dev_stripes);
+		if (ret < 0)
+			goto error;
+	}
 	/*
 	 * now sort the devices by hole size / available space
 	 */