btrfs: allow degenerate raid0/raid10

Message ID	20210722192955.18709-1-dsterba@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: David Sterba <dsterba@suse.com> To: linux-btrfs@vger.kernel.org Cc: David Sterba <dsterba@suse.com> Subject: [PATCH] btrfs: allow degenerate raid0/raid10 Date: Thu, 22 Jul 2021 21:29:55 +0200 Message-Id: <20210722192955.18709-1-dsterba@suse.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: allow degenerate raid0/raid10 \| expand btrfs: allow degenerate raid0/raid10

David Sterba July 22, 2021, 7:29 p.m. UTC

The data on raid0 and raid10 are supposed to be spread over multiple
devices, so the minimum constraints are set to 2 and 4 respectively.
This is an artificial limit and there's some interest to remove it.

Change this to allow raid0 on one device and raid10 on two devices. This
works as expected eg. when converting or removing devices.

The only difference is when raid0 on two devices gets one device
removed. Unpatched would silently create a single profile, while newly
it would be raid0.

The motivation is to allow to preserve the profile type as long as it
possible for some intermediate state (device removal, conversion).

Unpatched kernel will mount and use the degenerate profiles just fine
but won't allow any operation that would not satisfy the stricter device
number constraints, eg. not allowing to go from 3 to 2 devices for
raid10 or various profile conversions.

Example output:

  # btrfs fi us -T .
  Overall:
      Device size:                  10.00GiB
      Device allocated:              1.01GiB
      Device unallocated:            8.99GiB
      Device missing:                  0.00B
      Used:                        200.61MiB
      Free (estimated):              9.79GiB      (min: 9.79GiB)
      Free (statfs, df):             9.79GiB
      Data ratio:                       1.00
      Metadata ratio:                   1.00
      Global reserve:                3.25MiB      (used: 0.00B)
      Multiple profiles:                  no

		Data      Metadata  System
  Id Path       RAID0     single    single   Unallocated
  -- ---------- --------- --------- -------- -----------
   1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
  -- ---------- --------- --------- -------- -----------
     Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
     Used       200.25MiB 352.00KiB 16.00KiB

  # btrfs dev us .
  /dev/sda10, ID: 1
     Device size:            10.00GiB
     Device slack:              0.00B
     Data,RAID0/1:            1.00GiB
     Metadata,single:         8.00MiB
     System,single:           1.00MiB
     Unallocated:             8.99GiB

Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
profile is printed.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/volumes.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Qu Wenruo July 22, 2021, 10:51 p.m. UTC | #1

On 2021/7/23 上午3:29, David Sterba wrote:
> The data on raid0 and raid10 are supposed to be spread over multiple
> devices, so the minimum constraints are set to 2 and 4 respectively.
> This is an artificial limit and there's some interest to remove it.

This could be a better way to solve the SINGLE chunk created by degraded
mount.

>
> Change this to allow raid0 on one device and raid10 on two devices. This
> works as expected eg. when converting or removing devices.
>
> The only difference is when raid0 on two devices gets one device
> removed. Unpatched would silently create a single profile, while newly
> it would be raid0.
>
> The motivation is to allow to preserve the profile type as long as it
> possible for some intermediate state (device removal, conversion).
>
> Unpatched kernel will mount and use the degenerate profiles just fine
> but won't allow any operation that would not satisfy the stricter device
> number constraints, eg. not allowing to go from 3 to 2 devices for
> raid10 or various profile conversions.

My initial thought is, tree-checker will report errors like crazy, but
no, the check for RAID1 only cares substripe, while for RAID0 no number
of devices check.

So a good surprise here.


Another thing is about the single device RAID0 or 2 devices RAID10 is
the stripe splitting.

Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1.
Thus they need no stripe splitting at all.

But we will still do the stripe calculation, thus it could slightly
reduce the performance.
Not a big deal though.

>
> Example output:
>
>    # btrfs fi us -T .
>    Overall:
>        Device size:                  10.00GiB
>        Device allocated:              1.01GiB
>        Device unallocated:            8.99GiB
>        Device missing:                  0.00B
>        Used:                        200.61MiB
>        Free (estimated):              9.79GiB      (min: 9.79GiB)
>        Free (statfs, df):             9.79GiB
>        Data ratio:                       1.00
>        Metadata ratio:                   1.00
>        Global reserve:                3.25MiB      (used: 0.00B)
>        Multiple profiles:                  no
>
> 		Data      Metadata  System
>    Id Path       RAID0     single    single   Unallocated
>    -- ---------- --------- --------- -------- -----------
>     1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
>    -- ---------- --------- --------- -------- -----------
>       Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
>       Used       200.25MiB 352.00KiB 16.00KiB
>
>    # btrfs dev us .
>    /dev/sda10, ID: 1
>       Device size:            10.00GiB
>       Device slack:              0.00B
>       Data,RAID0/1:            1.00GiB

Can we slightly enhance the output?
RAID0/1 really looks like a new profile now, even the "1" really means
the number of device.

>       Metadata,single:         8.00MiB
>       System,single:           1.00MiB
>       Unallocated:             8.99GiB
>
> Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
> profile is printed.
>
> Signed-off-by: David Sterba <dsterba@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/volumes.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 86846d6e58d0..ad943357072b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -38,7 +38,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   		.sub_stripes	= 2,
>   		.dev_stripes	= 1,
>   		.devs_max	= 0,	/* 0 == as many as possible */
> -		.devs_min	= 4,
> +		.devs_min	= 2,
>   		.tolerated_failures = 1,
>   		.devs_increment	= 2,
>   		.ncopies	= 2,
> @@ -103,7 +103,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>   		.sub_stripes	= 1,
>   		.dev_stripes	= 1,
>   		.devs_max	= 0,
> -		.devs_min	= 2,
> +		.devs_min	= 1,
>   		.tolerated_failures = 0,
>   		.devs_increment	= 1,
>   		.ncopies	= 1,
>

David Sterba July 23, 2021, 2:08 p.m. UTC | #2

On Fri, Jul 23, 2021 at 06:51:31AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/7/23 上午3:29, David Sterba wrote:
> > The data on raid0 and raid10 are supposed to be spread over multiple
> > devices, so the minimum constraints are set to 2 and 4 respectively.
> > This is an artificial limit and there's some interest to remove it.
> 
> This could be a better way to solve the SINGLE chunk created by degraded
> mount.

Yes, but in this case it's rather a conicidence because raid0 now
becomes a valid fallback profile, other cases are not affected. There's
also some interest to allow full write with missing devices (as long as
complete data can be written, not necessarily to all copies). MD-RAID
allows that.

As an example, when we'd allow that, 2 device raid1 with one missing
will continue to write to the present device and once the missing device
reappears, scrub will fill the missing bits, or device replace will do a
full sync.

> > Change this to allow raid0 on one device and raid10 on two devices. This
> > works as expected eg. when converting or removing devices.
> >
> > The only difference is when raid0 on two devices gets one device
> > removed. Unpatched would silently create a single profile, while newly
> > it would be raid0.
> >
> > The motivation is to allow to preserve the profile type as long as it
> > possible for some intermediate state (device removal, conversion).
> >
> > Unpatched kernel will mount and use the degenerate profiles just fine
> > but won't allow any operation that would not satisfy the stricter device
> > number constraints, eg. not allowing to go from 3 to 2 devices for
> > raid10 or various profile conversions.
> 
> My initial thought is, tree-checker will report errors like crazy, but
> no, the check for RAID1 only cares substripe, while for RAID0 no number
> of devices check.
> 
> So a good surprise here.
> 
> Another thing is about the single device RAID0 or 2 devices RAID10 is
> the stripe splitting.
> 
> Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1.
> Thus they need no stripe splitting at all.
> 
> But we will still do the stripe calculation, thus it could slightly
> reduce the performance.
> Not a big deal though.

Yeah effectively they're raid0 == single, raid10 == raid1, I haven't
checked the overhead of the additional striping logic nor measured
performance impact but I don't feel it would be noticeable.

> > Example output:
> >
> >    # btrfs fi us -T .
> >    Overall:
> >        Device size:                  10.00GiB
> >        Device allocated:              1.01GiB
> >        Device unallocated:            8.99GiB
> >        Device missing:                  0.00B
> >        Used:                        200.61MiB
> >        Free (estimated):              9.79GiB      (min: 9.79GiB)
> >        Free (statfs, df):             9.79GiB
> >        Data ratio:                       1.00
> >        Metadata ratio:                   1.00
> >        Global reserve:                3.25MiB      (used: 0.00B)
> >        Multiple profiles:                  no
> >
> > 		Data      Metadata  System
> >    Id Path       RAID0     single    single   Unallocated
> >    -- ---------- --------- --------- -------- -----------
> >     1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
> >    -- ---------- --------- --------- -------- -----------
> >       Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
> >       Used       200.25MiB 352.00KiB 16.00KiB
> >
> >    # btrfs dev us .
> >    /dev/sda10, ID: 1
> >       Device size:            10.00GiB
> >       Device slack:              0.00B
> >       Data,RAID0/1:            1.00GiB
> 
> Can we slightly enhance the output?
> RAID0/1 really looks like a new profile now, even the "1" really means
> the number of device.

Do you have a concrete suggestion? This format was inspired by a
discussion and suggested by users so I guess this is what people expect
and I find it clear. It's also documented in manual page so if you think
it's not clear or missing some important information, please let me
know.

Roman Mamedov July 23, 2021, 5:27 p.m. UTC | #3

On Fri, 23 Jul 2021 16:08:43 +0200
David Sterba <dsterba@suse.cz> wrote:

> > Can we slightly enhance the output?
> > RAID0/1 really looks like a new profile now, even the "1" really means
> > the number of device.
> 
> Do you have a concrete suggestion? This format was inspired by a
> discussion and suggested by users so I guess this is what people expect
> and I find it clear. It's also documented in manual page so if you think
> it's not clear or missing some important information, please let me
> know.

It really reads like another RAID level, easily confused with RAID10.

Or that it would flip between RAID0 and RAID1 depending on something.

Maybe something like RAID0d1?

David Sterba July 23, 2021, 7:21 p.m. UTC | #4

On Fri, Jul 23, 2021 at 10:27:30PM +0500, Roman Mamedov wrote:
> On Fri, 23 Jul 2021 16:08:43 +0200
> David Sterba <dsterba@suse.cz> wrote:
> 
> > > Can we slightly enhance the output?
> > > RAID0/1 really looks like a new profile now, even the "1" really means
> > > the number of device.
> > 
> > Do you have a concrete suggestion? This format was inspired by a
> > discussion and suggested by users so I guess this is what people expect
> > and I find it clear. It's also documented in manual page so if you think
> > it's not clear or missing some important information, please let me
> > know.
> 
> It really reads like another RAID level, easily confused with RAID10.
> 
> Or that it would flip between RAID0 and RAID1 depending on something.

I think it could be confusing when the number of stripes is also another
raid level, like /1 in this case. From the commit
https://github.com/kdave/btrfs-progs/commit/4693e8226140289dcf8f0932af05895a38152817

/dev/vdc, ID: 1
   Device size:             1.00GiB
   Device slack:              0.00B
   Data,RAID0/2:          912.62MiB
   Data,RAID0/3:          912.62MiB
   Metadata,RAID1:        102.38MiB
   System,RAID1:            8.00MiB
   Unallocated:             1.00MiB

it's IMHO clear or at least prompting to read the docs what it means.

> Maybe something like RAID0d1?

That looks similar to RAID1c3 which I'd interpret as a new profile as
well. The raid56 profiles also print the stripe count so I don't know if
eg. RAID5d4 is really an improvement.

A 4 device mix of raid56 data and metadata would look like:

# btrfs dev us .
/dev/sda10, ID: 1
   Device size:            10.00GiB
   Device slack:              0.00B
   Data,RAID5/4:            1.00GiB
   Metadata,RAID6/4:       64.00MiB
   System,RAID6/4:          8.00MiB
   Unallocated:             8.93GiB

/dev/sda11, ID: 2
   Device size:            10.00GiB
   Device slack:              0.00B
   Data,RAID5/4:            1.00GiB
   Metadata,RAID6/4:       64.00MiB
   System,RAID6/4:          8.00MiB
   Unallocated:             8.93GiB

/dev/sda12, ID: 3
   Device size:            10.00GiB
   Device slack:              0.00B
   Data,RAID5/4:            1.00GiB
   Metadata,RAID6/4:       64.00MiB
   System,RAID6/4:          8.00MiB
   Unallocated:             8.93GiB

/dev/sda13, ID: 4
   Device size:            10.00GiB
   Device slack:              0.00B
   Data,RAID5/4:            1.00GiB
   Metadata,RAID6/4:       64.00MiB
   System,RAID6/4:          8.00MiB
   Unallocated:             8.93GiB

Maybe it's still too new so nobody is used to it and we've always had
problems with the raid naming scheme anyway.

Qu Wenruo July 23, 2021, 10:35 p.m. UTC | #5

On 2021/7/23 下午10:08, David Sterba wrote:
> On Fri, Jul 23, 2021 at 06:51:31AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/7/23 上午3:29, David Sterba wrote:
>>> The data on raid0 and raid10 are supposed to be spread over multiple
>>> devices, so the minimum constraints are set to 2 and 4 respectively.
>>> This is an artificial limit and there's some interest to remove it.
>>
>> This could be a better way to solve the SINGLE chunk created by degraded
>> mount.
>
> Yes, but in this case it's rather a conicidence because raid0 now
> becomes a valid fallback profile, other cases are not affected. There's
> also some interest to allow full write with missing devices (as long as
> complete data can be written, not necessarily to all copies). MD-RAID
> allows that.
>
> As an example, when we'd allow that, 2 device raid1 with one missing
> will continue to write to the present device and once the missing device
> reappears, scrub will fill the missing bits, or device replace will do a
> full sync.
>
>>> Change this to allow raid0 on one device and raid10 on two devices. This
>>> works as expected eg. when converting or removing devices.
>>>
>>> The only difference is when raid0 on two devices gets one device
>>> removed. Unpatched would silently create a single profile, while newly
>>> it would be raid0.
>>>
>>> The motivation is to allow to preserve the profile type as long as it
>>> possible for some intermediate state (device removal, conversion).
>>>
>>> Unpatched kernel will mount and use the degenerate profiles just fine
>>> but won't allow any operation that would not satisfy the stricter device
>>> number constraints, eg. not allowing to go from 3 to 2 devices for
>>> raid10 or various profile conversions.
>>
>> My initial thought is, tree-checker will report errors like crazy, but
>> no, the check for RAID1 only cares substripe, while for RAID0 no number
>> of devices check.
>>
>> So a good surprise here.
>>
>> Another thing is about the single device RAID0 or 2 devices RAID10 is
>> the stripe splitting.
>>
>> Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1.
>> Thus they need no stripe splitting at all.
>>
>> But we will still do the stripe calculation, thus it could slightly
>> reduce the performance.
>> Not a big deal though.
>
> Yeah effectively they're raid0 == single, raid10 == raid1, I haven't
> checked the overhead of the additional striping logic nor measured
> performance impact but I don't feel it would be noticeable.
>
>>> Example output:
>>>
>>>     # btrfs fi us -T .
>>>     Overall:
>>>         Device size:                  10.00GiB
>>>         Device allocated:              1.01GiB
>>>         Device unallocated:            8.99GiB
>>>         Device missing:                  0.00B
>>>         Used:                        200.61MiB
>>>         Free (estimated):              9.79GiB      (min: 9.79GiB)
>>>         Free (statfs, df):             9.79GiB
>>>         Data ratio:                       1.00
>>>         Metadata ratio:                   1.00
>>>         Global reserve:                3.25MiB      (used: 0.00B)
>>>         Multiple profiles:                  no
>>>
>>> 		Data      Metadata  System
>>>     Id Path       RAID0     single    single   Unallocated
>>>     -- ---------- --------- --------- -------- -----------
>>>      1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
>>>     -- ---------- --------- --------- -------- -----------
>>>        Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
>>>        Used       200.25MiB 352.00KiB 16.00KiB
>>>
>>>     # btrfs dev us .
>>>     /dev/sda10, ID: 1
>>>        Device size:            10.00GiB
>>>        Device slack:              0.00B
>>>        Data,RAID0/1:            1.00GiB
>>
>> Can we slightly enhance the output?
>> RAID0/1 really looks like a new profile now, even the "1" really means
>> the number of device.
>
> Do you have a concrete suggestion? This format was inspired by a
> discussion and suggested by users so I guess this is what people expect
> and I find it clear. It's also documented in manual page so if you think
> it's not clear or missing some important information, please let me
> know.

My idea may not be pretty though:

         Data,RAID0 (1 dev):            1.00GiB

As if we follow the existing pattern, it can be more confusing like:

         Data,RAID5/6

Thanks,
Qu

waxhead July 24, 2021, 11:04 a.m. UTC | #6

David Sterba wrote:

> Maybe it's still too new so nobody is used to it and we've always had
> problems with the raid naming scheme anyway.

Perhaps slightly off topic , but I see constantly that people do not 
understand how BTRFS "RAID" implementation works. They tend to confuse 
it with regular RAID and get angry because they run into "issues" simply 
because they do not understand the differences.

I have been an enthusiastic BTRFS user for years, and I actually caught 
myself incorrectly explaining how regular RAID works to a guy a while 
ago. This happened simply because my mind was so used to how BTRFS uses 
this terminology that I did not think about it.

As BTRFS is getting used more and more it may be increasingly difficult 
(if not impossible) to get rid of the "RAID" terminology, but in my 
opinion also increasingly more important as well.

Some years ago (2018) there was some talk about a new naming scheme
https://marc.info/?l=linux-btrfs&m=136286324417767

While technically spot on I found Hugo's naming scheme difficult. It was 
based on this idea:
numCOPIESnumSTRIPESnumPARITY

1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.

I also do not agree with the use of 'copy'. The Oxford dictionary 
defines 'copy' as "a thing that is made to be the same as something 
else, especially a document or a work of art"

And while some might argue that copying something on disk from memory 
makes it a copy, it ceases to be a copy once the memory contents is 
erased. I therefore think that replicas is a far better terminology.

I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is 
far more readable so if I may dare to be as bold....

SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)

This (or Hugo's) type of naming scheme would also easily allow add more 
exotic configuration such as S5 e.g. stripe over 5 devices in a 10 
device storage system which could increase throughput for certain 
workloads (because it leaves half the storage devices "free" for other jobs)
A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy 
peasy...  And just for the record , the old RAID terminology should of 
course work for compatibility reasons, but ideally should not be 
advertised at all.

Sorry for completely derailing the topic, but I felt it was important to 
bring up (and I admit to be overenthusiastic about it). :)

Hugo Mills July 24, 2021, 11:24 a.m. UTC | #7

On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
> David Sterba wrote:
> 
> > Maybe it's still too new so nobody is used to it and we've always had
> > problems with the raid naming scheme anyway.
> 
> Some years ago (2018) there was some talk about a new naming scheme
> https://marc.info/?l=linux-btrfs&m=136286324417767
> 
> While technically spot on I found Hugo's naming scheme difficult. It was
> based on this idea:
> numCOPIESnumSTRIPESnumPARITY
> 
> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
> 
> I also do not agree with the use of 'copy'. The Oxford dictionary defines
> 'copy' as "a thing that is made to be the same as something else, especially
> a document or a work of art"
> 
> And while some might argue that copying something on disk from memory makes
> it a copy, it ceases to be a copy once the memory contents is erased. I
> therefore think that replicas is a far better terminology.
> 
> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
> more readable so if I may dare to be as bold....
> 
> SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
> DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
> RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
> RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
> RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
> RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
> RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)
> 
> This (or Hugo's) type of naming scheme would also easily allow add more
> exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device
> storage system which could increase throughput for certain workloads
> (because it leaves half the storage devices "free" for other jobs)
> A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy
> peasy...  And just for the record , the old RAID terminology should of
> course work for compatibility reasons, but ideally should not be advertised
> at all.
> 
> Sorry for completely derailing the topic, but I felt it was important to
> bring up (and I admit to be overenthusiastic about it). :)

   I'd go along with that scheme, with one minor modification -- make
the leading letters lower-case. The choice of lower-case letters in my
original scheme was deliberate, as it breaks up the sequence and is
much easier to pick out the most important parts (the numbers) from
the mere positional markers (the letters).

   Also, the "M" (caps, because it's equivalent to the large numbers)
in stripes wasn't for "max", but simply the conventional mathematical
"m" -- some number acting as a limit to a counter (as in, "we have n
copies with m stripes and p parity stripes").

   Hugo.

waxhead July 24, 2021, 11:49 a.m. UTC | #8

Hugo Mills wrote:
> On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
>> David Sterba wrote:
>>
>>> Maybe it's still too new so nobody is used to it and we've always had
>>> problems with the raid naming scheme anyway.
>>
>> Some years ago (2018) there was some talk about a new naming scheme
>> https://marc.info/?l=linux-btrfs&m=136286324417767
>>
>> While technically spot on I found Hugo's naming scheme difficult. It was
>> based on this idea:
>> numCOPIESnumSTRIPESnumPARITY
>>
>> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
>>
>> I also do not agree with the use of 'copy'. The Oxford dictionary defines
>> 'copy' as "a thing that is made to be the same as something else, especially
>> a document or a work of art"
>>
>> And while some might argue that copying something on disk from memory makes
>> it a copy, it ceases to be a copy once the memory contents is erased. I
>> therefore think that replicas is a far better terminology.
>>
>> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
>> more readable so if I may dare to be as bold....
>>
>> SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
>> DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
>> RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
>> RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
>> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
>> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
>> RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
>> RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
>> RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)
>>
>> This (or Hugo's) type of naming scheme would also easily allow add more
>> exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device
>> storage system which could increase throughput for certain workloads
>> (because it leaves half the storage devices "free" for other jobs)
>> A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy
>> peasy...  And just for the record , the old RAID terminology should of
>> course work for compatibility reasons, but ideally should not be advertised
>> at all.
>>
>> Sorry for completely derailing the topic, but I felt it was important to
>> bring up (and I admit to be overenthusiastic about it). :)
> 
>     I'd go along with that scheme, with one minor modification -- make
> the leading letters lower-case. The choice of lower-case letters in my
> original scheme was deliberate, as it breaks up the sequence and is
> much easier to pick out the most important parts (the numbers) from
> the mere positional markers (the letters).
> 
>     Also, the "M" (caps, because it's equivalent to the large numbers)
> in stripes wasn't for "max", but simply the conventional mathematical
> "m" -- some number acting as a limit to a counter (as in, "we have n
> copies with m stripes and p parity stripes").
> 
>     Hugo.
> 
Agree. Lowercase r0.s0.p0 / r1.sM.p2 is more readable indeed.
I insist on the dots between for separators as this would make possibly 
future fantasy things such as rmin-max e.g. r2-4.sM.p0 more readable.

(in my fantasy world: r2-6 would mean 6 replicas where all can 
automatically be deleted except 2 if the filesystem runs low on space. 
Would make parallel read potentially super fast as long as there is 
plenty of free space on the filesystem plus increase redundancy. Free 
space is wasted space (just like with memory so it might as well be used 
for something useful)

Hugo Mills July 24, 2021, 11:55 a.m. UTC | #9

On Sat, Jul 24, 2021 at 01:49:30PM +0200, waxhead wrote:
> 
> 
> Hugo Mills wrote:
> > On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
> > > David Sterba wrote:
> > > 
> > > > Maybe it's still too new so nobody is used to it and we've always had
> > > > problems with the raid naming scheme anyway.
> > > 
> > > Some years ago (2018) there was some talk about a new naming scheme
> > > https://marc.info/?l=linux-btrfs&m=136286324417767
> > > 
> > > While technically spot on I found Hugo's naming scheme difficult. It was
> > > based on this idea:
> > > numCOPIESnumSTRIPESnumPARITY
> > > 
> > > 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
> > > 
> > > I also do not agree with the use of 'copy'. The Oxford dictionary defines
> > > 'copy' as "a thing that is made to be the same as something else, especially
> > > a document or a work of art"
> > > 
> > > And while some might argue that copying something on disk from memory makes
> > > it a copy, it ceases to be a copy once the memory contents is erased. I
> > > therefore think that replicas is a far better terminology.
> > > 
> > > I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
> > > more readable so if I may dare to be as bold....
> > > 
> > > SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
> > > DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
> > > RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
> > > RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
> > > RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
> > > RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
> > > RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
> > > RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
> > > RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)
> > > 
> > > This (or Hugo's) type of naming scheme would also easily allow add more
> > > exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device
> > > storage system which could increase throughput for certain workloads
> > > (because it leaves half the storage devices "free" for other jobs)
> > > A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy
> > > peasy...  And just for the record , the old RAID terminology should of
> > > course work for compatibility reasons, but ideally should not be advertised
> > > at all.
> > > 
> > > Sorry for completely derailing the topic, but I felt it was important to
> > > bring up (and I admit to be overenthusiastic about it). :)
> > 
> >     I'd go along with that scheme, with one minor modification -- make
> > the leading letters lower-case. The choice of lower-case letters in my
> > original scheme was deliberate, as it breaks up the sequence and is
> > much easier to pick out the most important parts (the numbers) from
> > the mere positional markers (the letters).
> > 
> >     Also, the "M" (caps, because it's equivalent to the large numbers)
> > in stripes wasn't for "max", but simply the conventional mathematical
> > "m" -- some number acting as a limit to a counter (as in, "we have n
> > copies with m stripes and p parity stripes").
> > 
> >     Hugo.
> > 
> Agree. Lowercase r0.s0.p0 / r1.sM.p2 is more readable indeed.
> I insist on the dots between for separators as this would make possibly
> future fantasy things such as rmin-max e.g. r2-4.sM.p0 more readable.
> 
> (in my fantasy world: r2-6 would mean 6 replicas where all can automatically
> be deleted except 2 if the filesystem runs low on space. Would make parallel
> read potentially super fast as long as there is plenty of free space on the
> filesystem plus increase redundancy. Free space is wasted space (just like
> with memory so it might as well be used for something useful)

   I like the dots.

   Ranges I'm thinking would be of particular use with stripes --
there have been discussions in the past about limiting the stripe
width on large numbers of devices, so that you don't end up with a
RAID-6 run across all 24 devices of an array for every strip. That
might be a use-case for, for example, r1.s3-8.p2.

   Hugo.

waxhead July 24, 2021, 12:07 p.m. UTC | #10

Hugo Mills wrote:
> On Sat, Jul 24, 2021 at 01:49:30PM +0200, waxhead wrote:
>>
>>
>> Hugo Mills wrote:
>>> On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
>>>> David Sterba wrote:
>>>>
>>>>> Maybe it's still too new so nobody is used to it and we've always had
>>>>> problems with the raid naming scheme anyway.
>>>>
>>>> Some years ago (2018) there was some talk about a new naming scheme
>>>> https://marc.info/?l=linux-btrfs&m=136286324417767
>>>>
>>>> While technically spot on I found Hugo's naming scheme difficult. It was
>>>> based on this idea:
>>>> numCOPIESnumSTRIPESnumPARITY
>>>>
>>>> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
>>>>
>>>> I also do not agree with the use of 'copy'. The Oxford dictionary defines
>>>> 'copy' as "a thing that is made to be the same as something else, especially
>>>> a document or a work of art"
>>>>
>>>> And while some might argue that copying something on disk from memory makes
>>>> it a copy, it ceases to be a copy once the memory contents is erased. I
>>>> therefore think that replicas is a far better terminology.
>>>>
>>>> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
>>>> more readable so if I may dare to be as bold....
>>>>
>>>> SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
>>>> DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
>>>> RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
>>>> RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
>>>> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
>>>> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
>>>> RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
>>>> RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
>>>> RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)
>>>>
>>>> This (or Hugo's) type of naming scheme would also easily allow add more
>>>> exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device
>>>> storage system which could increase throughput for certain workloads
>>>> (because it leaves half the storage devices "free" for other jobs)
>>>> A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy
>>>> peasy...  And just for the record , the old RAID terminology should of
>>>> course work for compatibility reasons, but ideally should not be advertised
>>>> at all.
>>>>
>>>> Sorry for completely derailing the topic, but I felt it was important to
>>>> bring up (and I admit to be overenthusiastic about it). :)
>>>
>>>      I'd go along with that scheme, with one minor modification -- make
>>> the leading letters lower-case. The choice of lower-case letters in my
>>> original scheme was deliberate, as it breaks up the sequence and is
>>> much easier to pick out the most important parts (the numbers) from
>>> the mere positional markers (the letters).
>>>
>>>      Also, the "M" (caps, because it's equivalent to the large numbers)
>>> in stripes wasn't for "max", but simply the conventional mathematical
>>> "m" -- some number acting as a limit to a counter (as in, "we have n
>>> copies with m stripes and p parity stripes").
>>>
>>>      Hugo.
>>>
>> Agree. Lowercase r0.s0.p0 / r1.sM.p2 is more readable indeed.
>> I insist on the dots between for separators as this would make possibly
>> future fantasy things such as rmin-max e.g. r2-4.sM.p0 more readable.
>>
>> (in my fantasy world: r2-6 would mean 6 replicas where all can automatically
>> be deleted except 2 if the filesystem runs low on space. Would make parallel
>> read potentially super fast as long as there is plenty of free space on the
>> filesystem plus increase redundancy. Free space is wasted space (just like
>> with memory so it might as well be used for something useful)
> 
>     I like the dots.
> 
>     Ranges I'm thinking would be of particular use with stripes --
> there have been discussions in the past about limiting the stripe
> width on large numbers of devices, so that you don't end up with a
> RAID-6 run across all 24 devices of an array for every strip. That
> might be a use-case for, for example, r1.s3-8.p2.
> 
>     Hugo.
> 
Not to mention the fact that ranges for stripes for example could 
potentially make parts of the filesystem survive what could otherwise be 
fatal. And if my wet dream of storage profiles per subvolume ever 
becomes possible the possibilities are endless....

Hugo Mills July 24, 2021, 12:30 p.m. UTC | #11

On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
> David Sterba wrote:
> 
> > Maybe it's still too new so nobody is used to it and we've always had
> > problems with the raid naming scheme anyway.
> 
> Perhaps slightly off topic , but I see constantly that people do not
> understand how BTRFS "RAID" implementation works. They tend to confuse it
> with regular RAID and get angry because they run into "issues" simply
> because they do not understand the differences.
> 
> I have been an enthusiastic BTRFS user for years, and I actually caught
> myself incorrectly explaining how regular RAID works to a guy a while ago.
> This happened simply because my mind was so used to how BTRFS uses this
> terminology that I did not think about it.
> 
> As BTRFS is getting used more and more it may be increasingly difficult (if
> not impossible) to get rid of the "RAID" terminology, but in my opinion also
> increasingly more important as well.
> 
> Some years ago (2018) there was some talk about a new naming scheme
> https://marc.info/?l=linux-btrfs&m=136286324417767
> 
> While technically spot on I found Hugo's naming scheme difficult. It was
> based on this idea:
> numCOPIESnumSTRIPESnumPARITY
> 
> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
> 
> I also do not agree with the use of 'copy'. The Oxford dictionary defines
> 'copy' as "a thing that is made to be the same as something else, especially
> a document or a work of art"
> 
> And while some might argue that copying something on disk from memory makes
> it a copy, it ceases to be a copy once the memory contents is erased. I
> therefore think that replicas is a far better terminology.
> 
> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
> more readable so if I may dare to be as bold....
> 
> SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
> DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
> RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
> RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
> RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
> RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
> RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)

   Sorry, I missed a detail here that someone pointed out on IRC.

   "r0" makes no sense to me, as that says there's no data. I would
argue strongly to add one to all of your r values. (Note that
"RAID1c2" isn't one of the current btrfs RAID levels, and by extension
from the others, it's equivalent to the current RAID1, and we have
RAID1c4 which is four complete instances of any item of data).

   My proposal counted *instances* of the data, not the redundancy.

   Hugo.

Forza July 24, 2021, 1:54 p.m. UTC | #12

On 2021-07-24 14:30, Hugo Mills wrote:
> On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
>> David Sterba wrote:
>>
>>> Maybe it's still too new so nobody is used to it and we've always had
>>> problems with the raid naming scheme anyway.
>>
>> Perhaps slightly off topic , but I see constantly that people do not
>> understand how BTRFS "RAID" implementation works. They tend to confuse it
>> with regular RAID and get angry because they run into "issues" simply
>> because they do not understand the differences.
>>
>> I have been an enthusiastic BTRFS user for years, and I actually caught
>> myself incorrectly explaining how regular RAID works to a guy a while ago.
>> This happened simply because my mind was so used to how BTRFS uses this
>> terminology that I did not think about it.
>>
>> As BTRFS is getting used more and more it may be increasingly difficult (if
>> not impossible) to get rid of the "RAID" terminology, but in my opinion also
>> increasingly more important as well.
>>
>> Some years ago (2018) there was some talk about a new naming scheme
>> https://marc.info/?l=linux-btrfs&m=136286324417767
>>
>> While technically spot on I found Hugo's naming scheme difficult. It was
>> based on this idea:
>> numCOPIESnumSTRIPESnumPARITY
>>
>> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
>>
>> I also do not agree with the use of 'copy'. The Oxford dictionary defines
>> 'copy' as "a thing that is made to be the same as something else, especially
>> a document or a work of art"
>>
>> And while some might argue that copying something on disk from memory makes
>> it a copy, it ceases to be a copy once the memory contents is erased. I
>> therefore think that replicas is a far better terminology.
>>
>> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
>> more readable so if I may dare to be as bold....
>>
>> SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
>> DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
>> RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
>> RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
>> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
>> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
>> RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
>> RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
>> RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)
> 
>     Sorry, I missed a detail here that someone pointed out on IRC.
> 
>     "r0" makes no sense to me, as that says there's no data. I would
> argue strongly to add one to all of your r values. (Note that
> "RAID1c2" isn't one of the current btrfs RAID levels, and by extension
> from the others, it's equivalent to the current RAID1, and we have
> RAID1c4 which is four complete instances of any item of data).
> 
>     My proposal counted *instances* of the data, not the redundancy.
> 
>     Hugo.
> 

I think Hugu is right that the terminology of "instance"[1] is easier to 
understand than copies or replicas.

Example:
"single" would be 1 instance
"dup" would be 2 instances
"raid1" would be 2 instances, 1 stripe, 0 parity
"raid1c3" would be 3 instances, 1 stripe, 0 parity
"raid1c4" would be 4 instances, 1 stripe, 0 parity
... and so on.

Shortened we could then use i<num>.s<num>.p<num> for Instances, Stripes 
and Parities.

Do we need a specific term for level of "redundancy"? In the current 
suggestions we can have redundancy either because of parity or of 
multiple instances. Perhaps the output of btrfs-progs could mention 
redundancy level such as this:

# btrfs fi us /mnt/btrfs/ -T
Overall:
     Device size:                  18.18TiB
     Device allocated:             11.24TiB
     Device unallocated:            6.93TiB
     Device missing:                  0.00B
     Used:                         11.21TiB
     Free (estimated):              6.97TiB      (min: 3.50TiB)
     Free (statfs, df):             6.97TiB
     Data ratio:                       1.00
     Metadata ratio:                   2.00
     Global reserve:              512.00MiB      (used: 0.00B)
     Multiple profiles:                  no

              Data     Metadata System
Mode:        i1,s0,p0 i2,s0,p0 i2,s0,p0
Redundancy:  0        1        1
------------ -------- -------- -------- -----------
Id Path                                 Unallocated
-- --------- -------- -------- -------- -----------
  3 /dev/sdb2  5.61TiB 17.00GiB 32.00MiB     3.47TiB
  4 /dev/sdd2  5.60TiB 17.00GiB 32.00MiB     3.47TiB
-- --------- -------- -------- -------- -----------
    Total     11.21TiB 17.00GiB 32.00MiB     6.94TiB
    Used      11.18TiB 15.76GiB  1.31MiB




Thanks
~Forza



[1] https://www.techopedia.com/definition/16325/instance

Zygo Blaxell July 24, 2021, 9:15 p.m. UTC | #13

On Sat, Jul 24, 2021 at 03:54:18PM +0200, Forza wrote:
> 
> 
> On 2021-07-24 14:30, Hugo Mills wrote:
> > On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
> > > David Sterba wrote:
> > > 
> > > > Maybe it's still too new so nobody is used to it and we've always had
> > > > problems with the raid naming scheme anyway.
> > > 
> > > Perhaps slightly off topic , but I see constantly that people do not
> > > understand how BTRFS "RAID" implementation works. They tend to confuse it
> > > with regular RAID and get angry because they run into "issues" simply
> > > because they do not understand the differences.
> > > 
> > > I have been an enthusiastic BTRFS user for years, and I actually caught
> > > myself incorrectly explaining how regular RAID works to a guy a while ago.
> > > This happened simply because my mind was so used to how BTRFS uses this
> > > terminology that I did not think about it.
> > > 
> > > As BTRFS is getting used more and more it may be increasingly difficult (if
> > > not impossible) to get rid of the "RAID" terminology, but in my opinion also
> > > increasingly more important as well.
> > > 
> > > Some years ago (2018) there was some talk about a new naming scheme
> > > https://marc.info/?l=linux-btrfs&m=136286324417767
> > > 
> > > While technically spot on I found Hugo's naming scheme difficult. It was
> > > based on this idea:
> > > numCOPIESnumSTRIPESnumPARITY
> > > 
> > > 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
> > > 
> > > I also do not agree with the use of 'copy'. The Oxford dictionary defines
> > > 'copy' as "a thing that is made to be the same as something else, especially
> > > a document or a work of art"

> > > And while some might argue that copying something on disk from memory makes
> > > it a copy, it ceases to be a copy once the memory contents is erased. 

That last sentence doesn't make sense.  A copy does not cease to be a
copy because the original (or some upstream copy) was destroyed.

I think the usage of "copy" is fine, and also better than the alternatives
proposed so far.  "copy" is shorter than "mirror", "replica", or
"instance", and "copy" or its abbreviation "c" already appears in various
btrfs documents, code, and symbolic names.

Most users on IRC understand "raid1 stores 2 copies of the data" as
an explanation, more successfully than "2 mirrors" or "2 replicas"
which usually require clarification.  As far as I can tell, "instances"
has never been used to describe btrfs profile-level data replication
before today.

"Instance" does have the advantage of being less ambiguous.  The other
words imply the existence of one additional extra thing, an "original"
that has been copied, mirrored, or replicated, which causes off-by-one
errors when people try to understand whether "two mirrors" without context
means 2 or 3 identical instances.  The disadvantage of "instance" is that
it abbreviates to "i", which causes problems when people write "I1" in a
sans-serif font.  We could use "n" instead, though people like to write
"array of N" disks.  Probably all of the letters are bad in some way.
I'll use "n" for now, as none of the non-instance words use that letter.

> > > I therefore think that replicas is a far better terminology.

> > > I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
> > > more readable so if I may dare to be as bold....

Right now the btrfs profile names are all snowflakes:  there is a short
string name, and we have to look up in a table what behavior to expect
when that name is invoked.  This matches the implementation, which uses
a single bit for all of the above names, and indeed does use a lookup
table to get the parameters (though the code also uses far too many 'if'
statements as well).

Here's the list of names adjusted according to comments so far, plus some
corrections (lowercase axis name, 1-based instance count, capitalize
'M', fix the number of stripes, and use correct btrfs profile names):

single  = n1.s1.p0 (1 instance, 1 stripe, no parity)
dup     = n2.s1.p0 (2 instances, 1 stripe, no parity)
raid0   = n1.sM.p0 (1 instance, (1..avail_disks) stripes, no parity)
raid1   = n2.s1.p0 (2 instances, 1 stripe, no parity)
raid1c3 = n3.s1.p0 (3 instances, 1 stripe, no parity)
raid1c4 = n4.s1.p0 (4 instances, 1 stripe, no parity)
raid10  = n1.sM.p0 (2 instances, 2..floor(avail_disks / num_instances) stripes, no parity)
raid5   = n1.sM.p1 (1 instance, 1..(avail_disks - parity) stripes, 1 parity)
raid6   = n1.sM.p2 (1 instance, 1..(avail_disks - parity) stripes, 2 parity)

There are three problems:

1.  Orthogonal naming conventions make it look like we should be able
to derive parameters by analyzing the names (e.g. RAID5 on RAID1 mirror
pairs is "n2.sM.p1", RAID1 of two RAID5 groups is...also "n2.sM.p1"...OK
there are 3 problems), but btrfs implements exactly the 9 profiles above
and nothing else.  We can only use the subset of orthogonal names that
exist in the implementation.

2.  The orthogonal-looking dimensions are not fully independent.
Is "n2.sM.p1" RAID5 of RAID1 mirror pairs, or RAID1 of 2x RAID5 parity
groups?  Arguably that's not so bad since neither exist yet--we could
say it must be the first one, since that's closer to what RAID10 does,
so it would basically be RAID10 with an extra (pair of) parity devices.

3.  Future profiles and even some of the existing profiles do not fit
in the orthogonal dimensions of this naming convention.

All the existing btrfs profiles except dup can be derived from a
formula with these 3 parameters.  2 copies implies min_devs == 2, so
"n2.s1.p0" is raid1, not dup.  There's no way to specify dup with the
3 parameters given.  That might not be a problem--we could say "nope,
'dup' does not fit the pattern, so we keep calling that profile 'dup'.
It's a special non-orthogonal case, and users must read the manual
to see exactly what it does."  Note that all forms of overloading the
"s" parameter are equivalent to an exceptional name for dup or a new
orthogonal parameter, because there's no integer we can plug into RAID
layout formulas to get the btrfs dup profile behavior.

RAID-Z from ZFS or the proposed extent-tree-v2 stripe tree would be named
"n1.sM.p1", the same name we have for the existing btrfs raid5 profile,
despite having completely different parity mechanics.  We'd need another
dimension in the profile name to describe how the parity blocks are
managed.  We can't write something like "pZ" or "pM" because we still
need the number of parity blocks.  If we write something like "pz2" or
"pm1" (for raid-z-style 2-parity or stripe-tree 1-parity respectively)
we are still sending the user to the manual to understand what the extra
letter means, and we're not much further ahead.

The window for defining these names seems to have closed about 13
years ago.  Even if we agreed the orthogonal names are better, we'd
have to make "n2.s1.p0" a non-canonical alias of the canonical "raid1"
name so we don't instantly break over a decade of existing tooling,
documentation, and user training.  We'd still have to explain over and
over to new users that "raid1" is "n2.s1.p0" in btrfs, not "nM.s1.p0"
as it is in mdadm raid1.

TL;DR I think this is somewhere between "not a good idea" and "a bad
idea" for naming storage profiles; however, there are some places where
we can still use parts of the proposal.

If the proposal includes _implementing_ an orthogonally parameterized
umbrella raid profile, then the above objection doesn't apply--but to
fit into btrfs with its single-bit-per-profile on-disk schema, all of
the orthogonally named raid profiles would be parameters for one new
btrfs "parameterized raid" profile.  We'd still need to have all the
old profiles with their old names to be compatible with old on-disk
filesystems, even if they were internally implemented by translating to
the parameterized implementation.

What I do like about this proposal is that it is useful as a _reporting_
format, and the reporting tools need help with that.  Right now we have
"RAID10" or "RAID10/6" for "RAID10 over 6 disks".  We could have tools
report "RAID/n2.s3.p0" instead, assuming we fixed all the quirky parts
of btrfs where the specific profile bit matters.  If we don't fix the
quirks, then we still need the full btrfs profile bit name reported,
because we need to be able to distinguish between "single/n1.s1.p0" and
"RAID0/n1.s1.p0" in a few corner cases.

"dup" would still be "dup", though it might have a suffix for the number
of distinct devices that are used for the instances (e.g. "dup/d2"
for instances on 2 devices, vs "dup/d1" for both instances on the same
device).  The suffix doesn't have to fit into the orthogonal naming
convention because the profile can select the parameter space.

If we implement a new btrfs raid profile that doesn't fit the pattern,
we can use the new profile's name as a prefix, e.g. on 5 disks, RAID-Z1
"raidz/n1.s4.p1" is distinguishable from btrfs RAID5 "raid5/n1.s4.p1",
so it's clear how many copies, parity, stripes, and disks there are,
and which parity mechanism is used.

> >     Sorry, I missed a detail here that someone pointed out on IRC.
> > 
> >     "r0" makes no sense to me, as that says there's no data. I would
> > argue strongly to add one to all of your r values. (Note that
> > "RAID1c2" isn't one of the current btrfs RAID levels, and by extension
> > from the others, it's equivalent to the current RAID1, and we have
> > RAID1c4 which is four complete instances of any item of data).
> > 
> >     My proposal counted *instances* of the data, not the redundancy.
> > 
> >     Hugo.
> > 
> 
> I think Hugu is right that the terminology of "instance"[1] is easier to
> understand than copies or replicas.
> 
> Example:
> "single" would be 1 instance
> "dup" would be 2 instances
> "raid1" would be 2 instances, 1 stripe, 0 parity
> "raid1c3" would be 3 instances, 1 stripe, 0 parity
> "raid1c4" would be 4 instances, 1 stripe, 0 parity
> ... and so on.
> 
> Shortened we could then use i<num>.s<num>.p<num> for Instances, Stripes and
> Parities.
> 
> Do we need a specific term for level of "redundancy"? In the current
> suggestions we can have redundancy either because of parity or of multiple
> instances. Perhaps the output of btrfs-progs could mention redundancy level
> such as this:
> 
> # btrfs fi us /mnt/btrfs/ -T
> Overall:
>     Device size:                  18.18TiB
>     Device allocated:             11.24TiB
>     Device unallocated:            6.93TiB
>     Device missing:                  0.00B
>     Used:                         11.21TiB
>     Free (estimated):              6.97TiB      (min: 3.50TiB)
>     Free (statfs, df):             6.97TiB
>     Data ratio:                       1.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
>     Multiple profiles:                  no
> 
>              Data     Metadata System
> Mode:        i1,s0,p0 i2,s0,p0 i2,s0,p0
> Redundancy:  0        1        1
> ------------ -------- -------- -------- -----------
> Id Path                                 Unallocated
> -- --------- -------- -------- -------- -----------
>  3 /dev/sdb2  5.61TiB 17.00GiB 32.00MiB     3.47TiB
>  4 /dev/sdd2  5.60TiB 17.00GiB 32.00MiB     3.47TiB
> -- --------- -------- -------- -------- -----------
>    Total     11.21TiB 17.00GiB 32.00MiB     6.94TiB
>    Used      11.18TiB 15.76GiB  1.31MiB
> 
> 
> 
> 
> Thanks
> ~Forza
> 
> 
> 
> [1] https://www.techopedia.com/definition/16325/instance

Zygo Blaxell July 24, 2021, 9:31 p.m. UTC | #14

On Fri, Jul 23, 2021 at 06:51:31AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/7/23 上午3:29, David Sterba wrote:
> > The data on raid0 and raid10 are supposed to be spread over multiple
> > devices, so the minimum constraints are set to 2 and 4 respectively.
> > This is an artificial limit and there's some interest to remove it.
> 
> This could be a better way to solve the SINGLE chunk created by degraded
> mount.
> 
> > 
> > Change this to allow raid0 on one device and raid10 on two devices. This
> > works as expected eg. when converting or removing devices.
> > 
> > The only difference is when raid0 on two devices gets one device
> > removed. Unpatched would silently create a single profile, while newly
> > it would be raid0.
> > 
> > The motivation is to allow to preserve the profile type as long as it
> > possible for some intermediate state (device removal, conversion).
> > 
> > Unpatched kernel will mount and use the degenerate profiles just fine
> > but won't allow any operation that would not satisfy the stricter device
> > number constraints, eg. not allowing to go from 3 to 2 devices for
> > raid10 or various profile conversions.
> 
> My initial thought is, tree-checker will report errors like crazy, but
> no, the check for RAID1 only cares substripe, while for RAID0 no number
> of devices check.
> 
> So a good surprise here.
> 
> 
> Another thing is about the single device RAID0 or 2 devices RAID10 is
> the stripe splitting.
> 
> Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1.
> Thus they need no stripe splitting at all.
> 
> But we will still do the stripe calculation, thus it could slightly
> reduce the performance.
> Not a big deal though.

It might have a more significant impact on a SSD cache below btrfs.
lvmcache (dm-cache) uses the IO request size to decide whether to bypass
the cache or not.

Scrubbing a striped-profile array through lvmcache is a disaster.  All the
IOs are carved up into chunks that are smaller than the cache bypass
threshold, so lvmcache tries to cache all the scrub IO and will flood
the SSD with enough writes to burn out a cheap SSD in a month or two.
Scrubbing raid1 or single uses bigger IO requests, so lvmcache bypasses
the SSD cache and there's no problem (other than incomplete scrub coverage
because of the caching, but that's a whole separate issue).

The kernel allows this already and it's a reasonably well-known gotcha (I
didn't discover it myself, I heard/read it somewhere else and confirmed
it) so there is no urgency, but some day either lvmcache should get
smarter about merging adjacent short reads, or btrfs should be smarter
about switching to "real" single profile behavior when it sees a striped
profile with 1 stripe.


> > Example output:
> > 
> >    # btrfs fi us -T .
> >    Overall:
> >        Device size:                  10.00GiB
> >        Device allocated:              1.01GiB
> >        Device unallocated:            8.99GiB
> >        Device missing:                  0.00B
> >        Used:                        200.61MiB
> >        Free (estimated):              9.79GiB      (min: 9.79GiB)
> >        Free (statfs, df):             9.79GiB
> >        Data ratio:                       1.00
> >        Metadata ratio:                   1.00
> >        Global reserve:                3.25MiB      (used: 0.00B)
> >        Multiple profiles:                  no
> > 
> > 		Data      Metadata  System
> >    Id Path       RAID0     single    single   Unallocated
> >    -- ---------- --------- --------- -------- -----------
> >     1 /dev/sda10   1.00GiB   8.00MiB  1.00MiB     8.99GiB
> >    -- ---------- --------- --------- -------- -----------
> >       Total        1.00GiB   8.00MiB  1.00MiB     8.99GiB
> >       Used       200.25MiB 352.00KiB 16.00KiB
> > 
> >    # btrfs dev us .
> >    /dev/sda10, ID: 1
> >       Device size:            10.00GiB
> >       Device slack:              0.00B
> >       Data,RAID0/1:            1.00GiB
> 
> Can we slightly enhance the output?
> RAID0/1 really looks like a new profile now, even the "1" really means
> the number of device.
> 
> >       Metadata,single:         8.00MiB
> >       System,single:           1.00MiB
> >       Unallocated:             8.99GiB
> > 
> > Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per
> > profile is printed.
> > 
> > Signed-off-by: David Sterba <dsterba@suse.com>
> 
> Reviewed-by: Qu Wenruo <wqu@suse.com>
> 
> Thanks,
> Qu
> > ---
> >   fs/btrfs/volumes.c | 4 ++--
> >   1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 86846d6e58d0..ad943357072b 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -38,7 +38,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
> >   		.sub_stripes	= 2,
> >   		.dev_stripes	= 1,
> >   		.devs_max	= 0,	/* 0 == as many as possible */
> > -		.devs_min	= 4,
> > +		.devs_min	= 2,
> >   		.tolerated_failures = 1,
> >   		.devs_increment	= 2,
> >   		.ncopies	= 2,
> > @@ -103,7 +103,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
> >   		.sub_stripes	= 1,
> >   		.dev_stripes	= 1,
> >   		.devs_max	= 0,
> > -		.devs_min	= 2,
> > +		.devs_min	= 1,
> >   		.tolerated_failures = 0,
> >   		.devs_increment	= 1,
> >   		.ncopies	= 1,
> >

waxhead July 24, 2021, 10:25 p.m. UTC | #15

Hugo Mills wrote:
> On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote:
>> David Sterba wrote:
>>
>>> Maybe it's still too new so nobody is used to it and we've always had
>>> problems with the raid naming scheme anyway.
>>
>> Perhaps slightly off topic , but I see constantly that people do not
>> understand how BTRFS "RAID" implementation works. They tend to confuse it
>> with regular RAID and get angry because they run into "issues" simply
>> because they do not understand the differences.
>>
>> I have been an enthusiastic BTRFS user for years, and I actually caught
>> myself incorrectly explaining how regular RAID works to a guy a while ago.
>> This happened simply because my mind was so used to how BTRFS uses this
>> terminology that I did not think about it.
>>
>> As BTRFS is getting used more and more it may be increasingly difficult (if
>> not impossible) to get rid of the "RAID" terminology, but in my opinion also
>> increasingly more important as well.
>>
>> Some years ago (2018) there was some talk about a new naming scheme
>> https://marc.info/?l=linux-btrfs&m=136286324417767
>>
>> While technically spot on I found Hugo's naming scheme difficult. It was
>> based on this idea:
>> numCOPIESnumSTRIPESnumPARITY
>>
>> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity.
>>
>> I also do not agree with the use of 'copy'. The Oxford dictionary defines
>> 'copy' as "a thing that is made to be the same as something else, especially
>> a document or a work of art"
>>
>> And while some might argue that copying something on disk from memory makes
>> it a copy, it ceases to be a copy once the memory contents is erased. I
>> therefore think that replicas is a far better terminology.
>>
>> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far
>> more readable so if I may dare to be as bold....
>>
>> SINGLE  = R0.S0.P0 (no replicas, no stripes (any device), no parity)
>> DUP     = R1.S1.P0 (1 replica, 1 stripe (one device), no parity)
>> RAID0   = R0.Sm.P0 (no replicas, max stripes, no parity)
>> RAID1   = R1.S0.P0 (1 replica, no stripes (any device), no parity)
>> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity)
>> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity)
>> RAID10  = R1.Sm.P0 (1 replica, max stripes, no parity)
>> RAID5   = R0.Sm.P1 (no replicas, max stripes, 1 parity)
>> RAID5   = R0.Sm.P2 (no replicas, max stripes, 2 parity)
> 
>     Sorry, I missed a detail here that someone pointed out on IRC.
> 
>     "r0" makes no sense to me, as that says there's no data. I would
> argue strongly to add one to all of your r values. (Note that

I disagree. R0 means no replicas of stored data. All filesystems have 
stored data so adding one seems rather pointless to me.

R0 would easily be interpreted as you can loose NO device.
Rn would also easily be interpreted as you can loose n device(s).

> "RAID1c2" isn't one of the current btrfs RAID levels, and by extension
> from the others, it's equivalent to the current RAID1, and we have
> RAID1c4 which is four complete instances of any item of data).
> 
>     My proposal counted *instances* of the data, not the redundancy.
> 
>     Hugo.
> 
You are correct that RAID1c2 does not exist, my mistake- it should have 
been RAID1c3 which is r2.s0.p0 which is 3 instances of data.

I am not directly against using instances of data, but since all 
filesystems have at least one instance to be meaningful I personally 
prefer using replicas as this would also clearer indicate how many 
devices you can loose. (for instances you have to subtract one anyway).

btrfs: allow degenerate raid0/raid10

Commit Message

Comments

Patch