diff mbox

[1/2] btrfs: create degraded-RAID1 chunks

Message ID 1461812780-538-2-git-send-email-anand.jain@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Anand Jain April 28, 2016, 3:06 a.m. UTC
When RAID1 is degraded, newer chunks should be degraded-RAID1
chunks instead of single chunks.

The bug is because the devs_min for raid1 was wrong it should
be 1, instead of 2.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/volumes.c | 38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

Comments

David Sterba April 29, 2016, 4:42 p.m. UTC | #1
On Thu, Apr 28, 2016 at 11:06:19AM +0800, Anand Jain wrote:
> When RAID1 is degraded, newer chunks should be degraded-RAID1
> chunks instead of single chunks.
> 
> The bug is because the devs_min for raid1 was wrong it should
> be 1, instead of 2.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>  fs/btrfs/volumes.c | 38 +++++++++++++++++++++++++++++++++-----
>  1 file changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index e2b54d546b7c..8b87ed6eb381 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -56,7 +56,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>  		.sub_stripes	= 1,
>  		.dev_stripes	= 1,
>  		.devs_max	= 2,
> -		.devs_min	= 2,
> +		.devs_min	= 1,

I think we should introduce another way how to determine the lower limit
for the degraded mounts. We need the proper raidX constraints and use
the degraded limits only if in case of the degraded mount.

>  		.tolerated_failures = 1,

Which is exactly the tolerated_failures:

  degraded_devs_min == devs_min - tolerated_failures

which works for all raid levels with redundancy.

>  		.devs_increment	= 2,
>  		.ncopies	= 2,
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain May 2, 2016, 6:10 a.m. UTC | #2
Thanks for comments, more below..

On 04/30/2016 12:42 AM, David Sterba wrote:
> On Thu, Apr 28, 2016 at 11:06:19AM +0800, Anand Jain wrote:
>> When RAID1 is degraded, newer chunks should be degraded-RAID1
>> chunks instead of single chunks.
>>
>> The bug is because the devs_min for raid1 was wrong it should
>> be 1, instead of 2.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   fs/btrfs/volumes.c | 38 +++++++++++++++++++++++++++++++++-----
>>   1 file changed, 33 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index e2b54d546b7c..8b87ed6eb381 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -56,7 +56,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>>   		.sub_stripes	= 1,
>>   		.dev_stripes	= 1,
>>   		.devs_max	= 2,
>> -		.devs_min	= 2,
>> +		.devs_min	= 1,
>
> I think we should introduce another way how to determine the lower limit
> for the degraded mounts. We need the proper raidX constraints and use
> the degraded limits only if in case of the degraded mount.
>
>>   		.tolerated_failures = 1,
>
> Which is exactly the tolerated_failures:
>
>    degraded_devs_min == devs_min - tolerated_failures

  that is devs_min is actually healthy_devs_min.

> which works for all raid levels with redundancy.

But not for RAID5 and RAID6.

Here is a (simulation?) tool which gives some ready ans.
I have added devs_min - tolerated_failures to it.

https://github.com/asj/btrfs-raid-cal.git

I am seeing problem as this:
RAID5&6 devs_min values are in the context of degraded volume.
RAID1&10.. devs_min values are in the context of healthy volume.

RAID56 is correct. We already have devs_max to know the number
of devices in a healthy volumes. RAID1 is devs_min is wrong so
it ended up being same as devs_max.

?

Thanks, Anand


>>   		.devs_increment	= 2,
>>   		.ncopies	= 2,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anand Jain May 10, 2016, 11 a.m. UTC | #3
>>> -        .devs_min    = 2,
>>> +        .devs_min    = 1,


>> I think we should introduce another way how to determine the lower limit
>> for the degraded mounts. We need the proper raidX constraints and use
>> the degraded limits only if in case of the degraded mount.

>>>           .tolerated_failures = 1,

>> Which is exactly the tolerated_failures:
>>
>>    degraded_devs_min == devs_min - tolerated_failures

>   that is devs_min is actually healthy_devs_min.

>> which works for all raid levels with redundancy.

> But not for RAID5 and RAID6.

> Here is a (simulation?) tool which gives some ready ans.
> I have added devs_min - tolerated_failures to it.
>
> https://github.com/asj/btrfs-raid-cal.git

I have copied the state table from the above repo and
modified to add the above equation.


[x1                 = devs_increment * sub_stripes]
[ndevs'             = ndevs - ndevs % devs_increment]
[num_stripes        = ndevs' * dev_stripes]
[data_stripes       = num_stripes / ncopies]
[missing_mirror_dev = x1 - ndevs']
[Y                  = devs_min - tolerated_failures ]


		R10 R1 DUP R0 Sn R5 R6
.sub_stripes	= 2, 1, 1, 1, 1, 1, 1
.dev_stripes	= 1, 1, 2, 1, 1, 1, 1
.devs_max	= 0, 2, 1, 0, 1, 0, 0
.devs_min	= 4, 1, 1, 2, 1, 2, 3
.tolerated_fails= 1, 1, 0, 0, 0, 1, 2
.devs_increment	= 2, 2, 1, 1, 1, 1, 1
.ncopies	= 2, 2, 2, 1, 1, 2, 3
  x1             = 4, 2, 1, 1, 1, 1, 1
  Y              = 3, 0, 1, 2, 1, 1, 1

[ndevs = 9]
ndevs		= 9, 9, 9, 9, 9, 9, 9
ndevs'		= 8, 2, 1, 9, 9, 9, 9
num_stripes	= 8, 2, 2, 9, 1, 9, 9
data_stripes	= 4, 1, 1, 9, 1, 8, 7

[ndevs = tolerated_fails + devs_min]
ndevs		= 5, 2, 1, 2, 1, 3, 5
ndevs'		= 4, 2, 1, 2, 1, 3, 5
num_stripes	= 4, 2, 2, 2, 1, 3, 5
data_stripes	= 2, 1, 1, 2, 1, 1, 1

[ndevs = devs_min]
ndevs		= 3, 1, 1, 2, 1, 2, 3
ndevs'		= 2, 0, 1, 2, 1, 2, 3
num_stripes	= -, -, 2, 2, 1, 2, 3
data_stripes	= -, -, 1, 2, 1, 1, 1


[ndevs = devs_min, with RAID1 patch fix]
ndevs		= 3, 1, 1, 2, 1, 2, 3
ndevs'		= 3, 1, 1, 2, 1, 2, 3
num_stripes	= 3, 1, 2, 2, 1, 2, 3
data_stripes	= ?, 1, 1, 2, 1, 1, 1



 > I am seeing problem as this:
 > RAID5&6 devs_min values are in the context of degraded volume.
 > RAID1&10.. devs_min values are in the context of healthy volume.
 >
 > RAID56 is correct. We already have devs_max to know the number
 > of devices in a healthy volumes. RAID1 is devs_min is wrong so
 > it ended up being same as devs_max.


Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e2b54d546b7c..8b87ed6eb381 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -56,7 +56,7 @@  const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.sub_stripes	= 1,
 		.dev_stripes	= 1,
 		.devs_max	= 2,
-		.devs_min	= 2,
+		.devs_min	= 1,
 		.tolerated_failures = 1,
 		.devs_increment	= 2,
 		.ncopies	= 2,
@@ -4513,6 +4513,7 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	int i;
 	int j;
 	int index;
+	int missing_dev = 0;
 
 	BUG_ON(!alloc_profile_is_valid(type, 0));
 
@@ -4627,14 +4628,35 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
-	/* round down to number of usable stripes */
-	ndevs -= ndevs % devs_increment;
 
-	if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
+	/*
+	 * For raid1 and raid10; ndevs = devs_min, is less than
+	 * (devs_increment * sub_stripes) = x in degraded mode.
+	 * For rest of the RAIDs, x is 1
+	 */
+	if (ndevs >= (devs_increment * sub_stripes)) {
+		/* round down to number of usable stripes */
+		ndevs -= ndevs % devs_increment;
+	}
+
+	/*
+	 * Fix devs_min for RAID10
+	 */
+	if (ndevs < devs_min) {
+		/* todo: look for better error reporting */
 		ret = -ENOSPC;
 		goto error;
 	}
 
+	/*
+	 * For RAID1 and RAID10 if there is no sufficient dev for mirror
+	 * or stripe, let the chunks be created in degraded mode.
+	 * And for rest of RAIDs its fine as (devs_increment * sub_stripes)
+	 * is 1.
+	 */
+	if (ndevs < (devs_increment * sub_stripes))
+		missing_dev = (devs_increment * sub_stripes) - ndevs;
+
 	if (devs_max && ndevs > devs_max)
 		ndevs = devs_max;
 	/*
@@ -4645,11 +4667,17 @@  static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	num_stripes = ndevs * dev_stripes;
 
 	/*
-	 * this will have to be fixed for RAID1 and RAID10 over
+	 * This will have to be fixed for RAID1 and RAID10 over
 	 * more drives
 	 */
 	data_stripes = num_stripes / ncopies;
 
+	if (type & BTRFS_BLOCK_GROUP_RAID1) {
+		if (missing_dev) {
+			/* For RAID1 data_stripes is always 1 rather */
+			data_stripes = num_stripes;
+		}
+	}
 	if (type & BTRFS_BLOCK_GROUP_RAID5) {
 		raid_stripe_len = find_raid56_stripe_len(ndevs - 1,
 				 btrfs_super_stripesize(info->super_copy));