Message ID | 20210722192955.18709-1-dsterba@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: allow degenerate raid0/raid10 | expand |
On 2021/7/23 上午3:29, David Sterba wrote: > The data on raid0 and raid10 are supposed to be spread over multiple > devices, so the minimum constraints are set to 2 and 4 respectively. > This is an artificial limit and there's some interest to remove it. This could be a better way to solve the SINGLE chunk created by degraded mount. > > Change this to allow raid0 on one device and raid10 on two devices. This > works as expected eg. when converting or removing devices. > > The only difference is when raid0 on two devices gets one device > removed. Unpatched would silently create a single profile, while newly > it would be raid0. > > The motivation is to allow to preserve the profile type as long as it > possible for some intermediate state (device removal, conversion). > > Unpatched kernel will mount and use the degenerate profiles just fine > but won't allow any operation that would not satisfy the stricter device > number constraints, eg. not allowing to go from 3 to 2 devices for > raid10 or various profile conversions. My initial thought is, tree-checker will report errors like crazy, but no, the check for RAID1 only cares substripe, while for RAID0 no number of devices check. So a good surprise here. Another thing is about the single device RAID0 or 2 devices RAID10 is the stripe splitting. Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1. Thus they need no stripe splitting at all. But we will still do the stripe calculation, thus it could slightly reduce the performance. Not a big deal though. > > Example output: > > # btrfs fi us -T . > Overall: > Device size: 10.00GiB > Device allocated: 1.01GiB > Device unallocated: 8.99GiB > Device missing: 0.00B > Used: 200.61MiB > Free (estimated): 9.79GiB (min: 9.79GiB) > Free (statfs, df): 9.79GiB > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 3.25MiB (used: 0.00B) > Multiple profiles: no > > Data Metadata System > Id Path RAID0 single single Unallocated > -- ---------- --------- --------- -------- ----------- > 1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB > -- ---------- --------- --------- -------- ----------- > Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB > Used 200.25MiB 352.00KiB 16.00KiB > > # btrfs dev us . > /dev/sda10, ID: 1 > Device size: 10.00GiB > Device slack: 0.00B > Data,RAID0/1: 1.00GiB Can we slightly enhance the output? RAID0/1 really looks like a new profile now, even the "1" really means the number of device. > Metadata,single: 8.00MiB > System,single: 1.00MiB > Unallocated: 8.99GiB > > Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per > profile is printed. > > Signed-off-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > fs/btrfs/volumes.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 86846d6e58d0..ad943357072b 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -38,7 +38,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .sub_stripes = 2, > .dev_stripes = 1, > .devs_max = 0, /* 0 == as many as possible */ > - .devs_min = 4, > + .devs_min = 2, > .tolerated_failures = 1, > .devs_increment = 2, > .ncopies = 2, > @@ -103,7 +103,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > .sub_stripes = 1, > .dev_stripes = 1, > .devs_max = 0, > - .devs_min = 2, > + .devs_min = 1, > .tolerated_failures = 0, > .devs_increment = 1, > .ncopies = 1, >
On Fri, Jul 23, 2021 at 06:51:31AM +0800, Qu Wenruo wrote: > > > On 2021/7/23 上午3:29, David Sterba wrote: > > The data on raid0 and raid10 are supposed to be spread over multiple > > devices, so the minimum constraints are set to 2 and 4 respectively. > > This is an artificial limit and there's some interest to remove it. > > This could be a better way to solve the SINGLE chunk created by degraded > mount. Yes, but in this case it's rather a conicidence because raid0 now becomes a valid fallback profile, other cases are not affected. There's also some interest to allow full write with missing devices (as long as complete data can be written, not necessarily to all copies). MD-RAID allows that. As an example, when we'd allow that, 2 device raid1 with one missing will continue to write to the present device and once the missing device reappears, scrub will fill the missing bits, or device replace will do a full sync. > > Change this to allow raid0 on one device and raid10 on two devices. This > > works as expected eg. when converting or removing devices. > > > > The only difference is when raid0 on two devices gets one device > > removed. Unpatched would silently create a single profile, while newly > > it would be raid0. > > > > The motivation is to allow to preserve the profile type as long as it > > possible for some intermediate state (device removal, conversion). > > > > Unpatched kernel will mount and use the degenerate profiles just fine > > but won't allow any operation that would not satisfy the stricter device > > number constraints, eg. not allowing to go from 3 to 2 devices for > > raid10 or various profile conversions. > > My initial thought is, tree-checker will report errors like crazy, but > no, the check for RAID1 only cares substripe, while for RAID0 no number > of devices check. > > So a good surprise here. > > Another thing is about the single device RAID0 or 2 devices RAID10 is > the stripe splitting. > > Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1. > Thus they need no stripe splitting at all. > > But we will still do the stripe calculation, thus it could slightly > reduce the performance. > Not a big deal though. Yeah effectively they're raid0 == single, raid10 == raid1, I haven't checked the overhead of the additional striping logic nor measured performance impact but I don't feel it would be noticeable. > > Example output: > > > > # btrfs fi us -T . > > Overall: > > Device size: 10.00GiB > > Device allocated: 1.01GiB > > Device unallocated: 8.99GiB > > Device missing: 0.00B > > Used: 200.61MiB > > Free (estimated): 9.79GiB (min: 9.79GiB) > > Free (statfs, df): 9.79GiB > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 3.25MiB (used: 0.00B) > > Multiple profiles: no > > > > Data Metadata System > > Id Path RAID0 single single Unallocated > > -- ---------- --------- --------- -------- ----------- > > 1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB > > -- ---------- --------- --------- -------- ----------- > > Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB > > Used 200.25MiB 352.00KiB 16.00KiB > > > > # btrfs dev us . > > /dev/sda10, ID: 1 > > Device size: 10.00GiB > > Device slack: 0.00B > > Data,RAID0/1: 1.00GiB > > Can we slightly enhance the output? > RAID0/1 really looks like a new profile now, even the "1" really means > the number of device. Do you have a concrete suggestion? This format was inspired by a discussion and suggested by users so I guess this is what people expect and I find it clear. It's also documented in manual page so if you think it's not clear or missing some important information, please let me know.
On Fri, 23 Jul 2021 16:08:43 +0200 David Sterba <dsterba@suse.cz> wrote: > > Can we slightly enhance the output? > > RAID0/1 really looks like a new profile now, even the "1" really means > > the number of device. > > Do you have a concrete suggestion? This format was inspired by a > discussion and suggested by users so I guess this is what people expect > and I find it clear. It's also documented in manual page so if you think > it's not clear or missing some important information, please let me > know. It really reads like another RAID level, easily confused with RAID10. Or that it would flip between RAID0 and RAID1 depending on something. Maybe something like RAID0d1?
On Fri, Jul 23, 2021 at 10:27:30PM +0500, Roman Mamedov wrote: > On Fri, 23 Jul 2021 16:08:43 +0200 > David Sterba <dsterba@suse.cz> wrote: > > > > Can we slightly enhance the output? > > > RAID0/1 really looks like a new profile now, even the "1" really means > > > the number of device. > > > > Do you have a concrete suggestion? This format was inspired by a > > discussion and suggested by users so I guess this is what people expect > > and I find it clear. It's also documented in manual page so if you think > > it's not clear or missing some important information, please let me > > know. > > It really reads like another RAID level, easily confused with RAID10. > > Or that it would flip between RAID0 and RAID1 depending on something. I think it could be confusing when the number of stripes is also another raid level, like /1 in this case. From the commit https://github.com/kdave/btrfs-progs/commit/4693e8226140289dcf8f0932af05895a38152817 /dev/vdc, ID: 1 Device size: 1.00GiB Device slack: 0.00B Data,RAID0/2: 912.62MiB Data,RAID0/3: 912.62MiB Metadata,RAID1: 102.38MiB System,RAID1: 8.00MiB Unallocated: 1.00MiB it's IMHO clear or at least prompting to read the docs what it means. > Maybe something like RAID0d1? That looks similar to RAID1c3 which I'd interpret as a new profile as well. The raid56 profiles also print the stripe count so I don't know if eg. RAID5d4 is really an improvement. A 4 device mix of raid56 data and metadata would look like: # btrfs dev us . /dev/sda10, ID: 1 Device size: 10.00GiB Device slack: 0.00B Data,RAID5/4: 1.00GiB Metadata,RAID6/4: 64.00MiB System,RAID6/4: 8.00MiB Unallocated: 8.93GiB /dev/sda11, ID: 2 Device size: 10.00GiB Device slack: 0.00B Data,RAID5/4: 1.00GiB Metadata,RAID6/4: 64.00MiB System,RAID6/4: 8.00MiB Unallocated: 8.93GiB /dev/sda12, ID: 3 Device size: 10.00GiB Device slack: 0.00B Data,RAID5/4: 1.00GiB Metadata,RAID6/4: 64.00MiB System,RAID6/4: 8.00MiB Unallocated: 8.93GiB /dev/sda13, ID: 4 Device size: 10.00GiB Device slack: 0.00B Data,RAID5/4: 1.00GiB Metadata,RAID6/4: 64.00MiB System,RAID6/4: 8.00MiB Unallocated: 8.93GiB Maybe it's still too new so nobody is used to it and we've always had problems with the raid naming scheme anyway.
On 2021/7/23 下午10:08, David Sterba wrote: > On Fri, Jul 23, 2021 at 06:51:31AM +0800, Qu Wenruo wrote: >> >> >> On 2021/7/23 上午3:29, David Sterba wrote: >>> The data on raid0 and raid10 are supposed to be spread over multiple >>> devices, so the minimum constraints are set to 2 and 4 respectively. >>> This is an artificial limit and there's some interest to remove it. >> >> This could be a better way to solve the SINGLE chunk created by degraded >> mount. > > Yes, but in this case it's rather a conicidence because raid0 now > becomes a valid fallback profile, other cases are not affected. There's > also some interest to allow full write with missing devices (as long as > complete data can be written, not necessarily to all copies). MD-RAID > allows that. > > As an example, when we'd allow that, 2 device raid1 with one missing > will continue to write to the present device and once the missing device > reappears, scrub will fill the missing bits, or device replace will do a > full sync. > >>> Change this to allow raid0 on one device and raid10 on two devices. This >>> works as expected eg. when converting or removing devices. >>> >>> The only difference is when raid0 on two devices gets one device >>> removed. Unpatched would silently create a single profile, while newly >>> it would be raid0. >>> >>> The motivation is to allow to preserve the profile type as long as it >>> possible for some intermediate state (device removal, conversion). >>> >>> Unpatched kernel will mount and use the degenerate profiles just fine >>> but won't allow any operation that would not satisfy the stricter device >>> number constraints, eg. not allowing to go from 3 to 2 devices for >>> raid10 or various profile conversions. >> >> My initial thought is, tree-checker will report errors like crazy, but >> no, the check for RAID1 only cares substripe, while for RAID0 no number >> of devices check. >> >> So a good surprise here. >> >> Another thing is about the single device RAID0 or 2 devices RAID10 is >> the stripe splitting. >> >> Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1. >> Thus they need no stripe splitting at all. >> >> But we will still do the stripe calculation, thus it could slightly >> reduce the performance. >> Not a big deal though. > > Yeah effectively they're raid0 == single, raid10 == raid1, I haven't > checked the overhead of the additional striping logic nor measured > performance impact but I don't feel it would be noticeable. > >>> Example output: >>> >>> # btrfs fi us -T . >>> Overall: >>> Device size: 10.00GiB >>> Device allocated: 1.01GiB >>> Device unallocated: 8.99GiB >>> Device missing: 0.00B >>> Used: 200.61MiB >>> Free (estimated): 9.79GiB (min: 9.79GiB) >>> Free (statfs, df): 9.79GiB >>> Data ratio: 1.00 >>> Metadata ratio: 1.00 >>> Global reserve: 3.25MiB (used: 0.00B) >>> Multiple profiles: no >>> >>> Data Metadata System >>> Id Path RAID0 single single Unallocated >>> -- ---------- --------- --------- -------- ----------- >>> 1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB >>> -- ---------- --------- --------- -------- ----------- >>> Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB >>> Used 200.25MiB 352.00KiB 16.00KiB >>> >>> # btrfs dev us . >>> /dev/sda10, ID: 1 >>> Device size: 10.00GiB >>> Device slack: 0.00B >>> Data,RAID0/1: 1.00GiB >> >> Can we slightly enhance the output? >> RAID0/1 really looks like a new profile now, even the "1" really means >> the number of device. > > Do you have a concrete suggestion? This format was inspired by a > discussion and suggested by users so I guess this is what people expect > and I find it clear. It's also documented in manual page so if you think > it's not clear or missing some important information, please let me > know. My idea may not be pretty though: Data,RAID0 (1 dev): 1.00GiB As if we follow the existing pattern, it can be more confusing like: Data,RAID5/6 Thanks, Qu
David Sterba wrote: > Maybe it's still too new so nobody is used to it and we've always had > problems with the raid naming scheme anyway. Perhaps slightly off topic , but I see constantly that people do not understand how BTRFS "RAID" implementation works. They tend to confuse it with regular RAID and get angry because they run into "issues" simply because they do not understand the differences. I have been an enthusiastic BTRFS user for years, and I actually caught myself incorrectly explaining how regular RAID works to a guy a while ago. This happened simply because my mind was so used to how BTRFS uses this terminology that I did not think about it. As BTRFS is getting used more and more it may be increasingly difficult (if not impossible) to get rid of the "RAID" terminology, but in my opinion also increasingly more important as well. Some years ago (2018) there was some talk about a new naming scheme https://marc.info/?l=linux-btrfs&m=136286324417767 While technically spot on I found Hugo's naming scheme difficult. It was based on this idea: numCOPIESnumSTRIPESnumPARITY 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. I also do not agree with the use of 'copy'. The Oxford dictionary defines 'copy' as "a thing that is made to be the same as something else, especially a document or a work of art" And while some might argue that copying something on disk from memory makes it a copy, it ceases to be a copy once the memory contents is erased. I therefore think that replicas is a far better terminology. I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far more readable so if I may dare to be as bold.... SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) This (or Hugo's) type of naming scheme would also easily allow add more exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device storage system which could increase throughput for certain workloads (because it leaves half the storage devices "free" for other jobs) A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy peasy... And just for the record , the old RAID terminology should of course work for compatibility reasons, but ideally should not be advertised at all. Sorry for completely derailing the topic, but I felt it was important to bring up (and I admit to be overenthusiastic about it). :)
On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: > David Sterba wrote: > > > Maybe it's still too new so nobody is used to it and we've always had > > problems with the raid naming scheme anyway. > > Some years ago (2018) there was some talk about a new naming scheme > https://marc.info/?l=linux-btrfs&m=136286324417767 > > While technically spot on I found Hugo's naming scheme difficult. It was > based on this idea: > numCOPIESnumSTRIPESnumPARITY > > 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. > > I also do not agree with the use of 'copy'. The Oxford dictionary defines > 'copy' as "a thing that is made to be the same as something else, especially > a document or a work of art" > > And while some might argue that copying something on disk from memory makes > it a copy, it ceases to be a copy once the memory contents is erased. I > therefore think that replicas is a far better terminology. > > I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far > more readable so if I may dare to be as bold.... > > SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) > DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) > RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) > RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) > RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) > RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) > RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) > RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) > RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) > > This (or Hugo's) type of naming scheme would also easily allow add more > exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device > storage system which could increase throughput for certain workloads > (because it leaves half the storage devices "free" for other jobs) > A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy > peasy... And just for the record , the old RAID terminology should of > course work for compatibility reasons, but ideally should not be advertised > at all. > > Sorry for completely derailing the topic, but I felt it was important to > bring up (and I admit to be overenthusiastic about it). :) I'd go along with that scheme, with one minor modification -- make the leading letters lower-case. The choice of lower-case letters in my original scheme was deliberate, as it breaks up the sequence and is much easier to pick out the most important parts (the numbers) from the mere positional markers (the letters). Also, the "M" (caps, because it's equivalent to the large numbers) in stripes wasn't for "max", but simply the conventional mathematical "m" -- some number acting as a limit to a counter (as in, "we have n copies with m stripes and p parity stripes"). Hugo.
Hugo Mills wrote: > On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: >> David Sterba wrote: >> >>> Maybe it's still too new so nobody is used to it and we've always had >>> problems with the raid naming scheme anyway. >> >> Some years ago (2018) there was some talk about a new naming scheme >> https://marc.info/?l=linux-btrfs&m=136286324417767 >> >> While technically spot on I found Hugo's naming scheme difficult. It was >> based on this idea: >> numCOPIESnumSTRIPESnumPARITY >> >> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. >> >> I also do not agree with the use of 'copy'. The Oxford dictionary defines >> 'copy' as "a thing that is made to be the same as something else, especially >> a document or a work of art" >> >> And while some might argue that copying something on disk from memory makes >> it a copy, it ceases to be a copy once the memory contents is erased. I >> therefore think that replicas is a far better terminology. >> >> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far >> more readable so if I may dare to be as bold.... >> >> SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) >> DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) >> RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) >> RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) >> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) >> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) >> RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) >> RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) >> RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) >> >> This (or Hugo's) type of naming scheme would also easily allow add more >> exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device >> storage system which could increase throughput for certain workloads >> (because it leaves half the storage devices "free" for other jobs) >> A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy >> peasy... And just for the record , the old RAID terminology should of >> course work for compatibility reasons, but ideally should not be advertised >> at all. >> >> Sorry for completely derailing the topic, but I felt it was important to >> bring up (and I admit to be overenthusiastic about it). :) > > I'd go along with that scheme, with one minor modification -- make > the leading letters lower-case. The choice of lower-case letters in my > original scheme was deliberate, as it breaks up the sequence and is > much easier to pick out the most important parts (the numbers) from > the mere positional markers (the letters). > > Also, the "M" (caps, because it's equivalent to the large numbers) > in stripes wasn't for "max", but simply the conventional mathematical > "m" -- some number acting as a limit to a counter (as in, "we have n > copies with m stripes and p parity stripes"). > > Hugo. > Agree. Lowercase r0.s0.p0 / r1.sM.p2 is more readable indeed. I insist on the dots between for separators as this would make possibly future fantasy things such as rmin-max e.g. r2-4.sM.p0 more readable. (in my fantasy world: r2-6 would mean 6 replicas where all can automatically be deleted except 2 if the filesystem runs low on space. Would make parallel read potentially super fast as long as there is plenty of free space on the filesystem plus increase redundancy. Free space is wasted space (just like with memory so it might as well be used for something useful)
On Sat, Jul 24, 2021 at 01:49:30PM +0200, waxhead wrote: > > > Hugo Mills wrote: > > On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: > > > David Sterba wrote: > > > > > > > Maybe it's still too new so nobody is used to it and we've always had > > > > problems with the raid naming scheme anyway. > > > > > > Some years ago (2018) there was some talk about a new naming scheme > > > https://marc.info/?l=linux-btrfs&m=136286324417767 > > > > > > While technically spot on I found Hugo's naming scheme difficult. It was > > > based on this idea: > > > numCOPIESnumSTRIPESnumPARITY > > > > > > 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. > > > > > > I also do not agree with the use of 'copy'. The Oxford dictionary defines > > > 'copy' as "a thing that is made to be the same as something else, especially > > > a document or a work of art" > > > > > > And while some might argue that copying something on disk from memory makes > > > it a copy, it ceases to be a copy once the memory contents is erased. I > > > therefore think that replicas is a far better terminology. > > > > > > I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far > > > more readable so if I may dare to be as bold.... > > > > > > SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) > > > DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) > > > RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) > > > RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) > > > RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) > > > RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) > > > RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) > > > RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) > > > RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) > > > > > > This (or Hugo's) type of naming scheme would also easily allow add more > > > exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device > > > storage system which could increase throughput for certain workloads > > > (because it leaves half the storage devices "free" for other jobs) > > > A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy > > > peasy... And just for the record , the old RAID terminology should of > > > course work for compatibility reasons, but ideally should not be advertised > > > at all. > > > > > > Sorry for completely derailing the topic, but I felt it was important to > > > bring up (and I admit to be overenthusiastic about it). :) > > > > I'd go along with that scheme, with one minor modification -- make > > the leading letters lower-case. The choice of lower-case letters in my > > original scheme was deliberate, as it breaks up the sequence and is > > much easier to pick out the most important parts (the numbers) from > > the mere positional markers (the letters). > > > > Also, the "M" (caps, because it's equivalent to the large numbers) > > in stripes wasn't for "max", but simply the conventional mathematical > > "m" -- some number acting as a limit to a counter (as in, "we have n > > copies with m stripes and p parity stripes"). > > > > Hugo. > > > Agree. Lowercase r0.s0.p0 / r1.sM.p2 is more readable indeed. > I insist on the dots between for separators as this would make possibly > future fantasy things such as rmin-max e.g. r2-4.sM.p0 more readable. > > (in my fantasy world: r2-6 would mean 6 replicas where all can automatically > be deleted except 2 if the filesystem runs low on space. Would make parallel > read potentially super fast as long as there is plenty of free space on the > filesystem plus increase redundancy. Free space is wasted space (just like > with memory so it might as well be used for something useful) I like the dots. Ranges I'm thinking would be of particular use with stripes -- there have been discussions in the past about limiting the stripe width on large numbers of devices, so that you don't end up with a RAID-6 run across all 24 devices of an array for every strip. That might be a use-case for, for example, r1.s3-8.p2. Hugo.
Hugo Mills wrote: > On Sat, Jul 24, 2021 at 01:49:30PM +0200, waxhead wrote: >> >> >> Hugo Mills wrote: >>> On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: >>>> David Sterba wrote: >>>> >>>>> Maybe it's still too new so nobody is used to it and we've always had >>>>> problems with the raid naming scheme anyway. >>>> >>>> Some years ago (2018) there was some talk about a new naming scheme >>>> https://marc.info/?l=linux-btrfs&m=136286324417767 >>>> >>>> While technically spot on I found Hugo's naming scheme difficult. It was >>>> based on this idea: >>>> numCOPIESnumSTRIPESnumPARITY >>>> >>>> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. >>>> >>>> I also do not agree with the use of 'copy'. The Oxford dictionary defines >>>> 'copy' as "a thing that is made to be the same as something else, especially >>>> a document or a work of art" >>>> >>>> And while some might argue that copying something on disk from memory makes >>>> it a copy, it ceases to be a copy once the memory contents is erased. I >>>> therefore think that replicas is a far better terminology. >>>> >>>> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far >>>> more readable so if I may dare to be as bold.... >>>> >>>> SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) >>>> DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) >>>> RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) >>>> RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) >>>> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) >>>> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) >>>> RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) >>>> RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) >>>> RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) >>>> >>>> This (or Hugo's) type of naming scheme would also easily allow add more >>>> exotic configuration such as S5 e.g. stripe over 5 devices in a 10 device >>>> storage system which could increase throughput for certain workloads >>>> (because it leaves half the storage devices "free" for other jobs) >>>> A variant of RAID5 behaving like RAID10 would simply be R1.Sm.P1. Easy >>>> peasy... And just for the record , the old RAID terminology should of >>>> course work for compatibility reasons, but ideally should not be advertised >>>> at all. >>>> >>>> Sorry for completely derailing the topic, but I felt it was important to >>>> bring up (and I admit to be overenthusiastic about it). :) >>> >>> I'd go along with that scheme, with one minor modification -- make >>> the leading letters lower-case. The choice of lower-case letters in my >>> original scheme was deliberate, as it breaks up the sequence and is >>> much easier to pick out the most important parts (the numbers) from >>> the mere positional markers (the letters). >>> >>> Also, the "M" (caps, because it's equivalent to the large numbers) >>> in stripes wasn't for "max", but simply the conventional mathematical >>> "m" -- some number acting as a limit to a counter (as in, "we have n >>> copies with m stripes and p parity stripes"). >>> >>> Hugo. >>> >> Agree. Lowercase r0.s0.p0 / r1.sM.p2 is more readable indeed. >> I insist on the dots between for separators as this would make possibly >> future fantasy things such as rmin-max e.g. r2-4.sM.p0 more readable. >> >> (in my fantasy world: r2-6 would mean 6 replicas where all can automatically >> be deleted except 2 if the filesystem runs low on space. Would make parallel >> read potentially super fast as long as there is plenty of free space on the >> filesystem plus increase redundancy. Free space is wasted space (just like >> with memory so it might as well be used for something useful) > > I like the dots. > > Ranges I'm thinking would be of particular use with stripes -- > there have been discussions in the past about limiting the stripe > width on large numbers of devices, so that you don't end up with a > RAID-6 run across all 24 devices of an array for every strip. That > might be a use-case for, for example, r1.s3-8.p2. > > Hugo. > Not to mention the fact that ranges for stripes for example could potentially make parts of the filesystem survive what could otherwise be fatal. And if my wet dream of storage profiles per subvolume ever becomes possible the possibilities are endless....
On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: > David Sterba wrote: > > > Maybe it's still too new so nobody is used to it and we've always had > > problems with the raid naming scheme anyway. > > Perhaps slightly off topic , but I see constantly that people do not > understand how BTRFS "RAID" implementation works. They tend to confuse it > with regular RAID and get angry because they run into "issues" simply > because they do not understand the differences. > > I have been an enthusiastic BTRFS user for years, and I actually caught > myself incorrectly explaining how regular RAID works to a guy a while ago. > This happened simply because my mind was so used to how BTRFS uses this > terminology that I did not think about it. > > As BTRFS is getting used more and more it may be increasingly difficult (if > not impossible) to get rid of the "RAID" terminology, but in my opinion also > increasingly more important as well. > > Some years ago (2018) there was some talk about a new naming scheme > https://marc.info/?l=linux-btrfs&m=136286324417767 > > While technically spot on I found Hugo's naming scheme difficult. It was > based on this idea: > numCOPIESnumSTRIPESnumPARITY > > 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. > > I also do not agree with the use of 'copy'. The Oxford dictionary defines > 'copy' as "a thing that is made to be the same as something else, especially > a document or a work of art" > > And while some might argue that copying something on disk from memory makes > it a copy, it ceases to be a copy once the memory contents is erased. I > therefore think that replicas is a far better terminology. > > I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far > more readable so if I may dare to be as bold.... > > SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) > DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) > RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) > RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) > RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) > RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) > RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) > RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) > RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) Sorry, I missed a detail here that someone pointed out on IRC. "r0" makes no sense to me, as that says there's no data. I would argue strongly to add one to all of your r values. (Note that "RAID1c2" isn't one of the current btrfs RAID levels, and by extension from the others, it's equivalent to the current RAID1, and we have RAID1c4 which is four complete instances of any item of data). My proposal counted *instances* of the data, not the redundancy. Hugo.
On 2021-07-24 14:30, Hugo Mills wrote: > On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: >> David Sterba wrote: >> >>> Maybe it's still too new so nobody is used to it and we've always had >>> problems with the raid naming scheme anyway. >> >> Perhaps slightly off topic , but I see constantly that people do not >> understand how BTRFS "RAID" implementation works. They tend to confuse it >> with regular RAID and get angry because they run into "issues" simply >> because they do not understand the differences. >> >> I have been an enthusiastic BTRFS user for years, and I actually caught >> myself incorrectly explaining how regular RAID works to a guy a while ago. >> This happened simply because my mind was so used to how BTRFS uses this >> terminology that I did not think about it. >> >> As BTRFS is getting used more and more it may be increasingly difficult (if >> not impossible) to get rid of the "RAID" terminology, but in my opinion also >> increasingly more important as well. >> >> Some years ago (2018) there was some talk about a new naming scheme >> https://marc.info/?l=linux-btrfs&m=136286324417767 >> >> While technically spot on I found Hugo's naming scheme difficult. It was >> based on this idea: >> numCOPIESnumSTRIPESnumPARITY >> >> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. >> >> I also do not agree with the use of 'copy'. The Oxford dictionary defines >> 'copy' as "a thing that is made to be the same as something else, especially >> a document or a work of art" >> >> And while some might argue that copying something on disk from memory makes >> it a copy, it ceases to be a copy once the memory contents is erased. I >> therefore think that replicas is a far better terminology. >> >> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far >> more readable so if I may dare to be as bold.... >> >> SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) >> DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) >> RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) >> RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) >> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) >> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) >> RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) >> RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) >> RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) > > Sorry, I missed a detail here that someone pointed out on IRC. > > "r0" makes no sense to me, as that says there's no data. I would > argue strongly to add one to all of your r values. (Note that > "RAID1c2" isn't one of the current btrfs RAID levels, and by extension > from the others, it's equivalent to the current RAID1, and we have > RAID1c4 which is four complete instances of any item of data). > > My proposal counted *instances* of the data, not the redundancy. > > Hugo. > I think Hugu is right that the terminology of "instance"[1] is easier to understand than copies or replicas. Example: "single" would be 1 instance "dup" would be 2 instances "raid1" would be 2 instances, 1 stripe, 0 parity "raid1c3" would be 3 instances, 1 stripe, 0 parity "raid1c4" would be 4 instances, 1 stripe, 0 parity ... and so on. Shortened we could then use i<num>.s<num>.p<num> for Instances, Stripes and Parities. Do we need a specific term for level of "redundancy"? In the current suggestions we can have redundancy either because of parity or of multiple instances. Perhaps the output of btrfs-progs could mention redundancy level such as this: # btrfs fi us /mnt/btrfs/ -T Overall: Device size: 18.18TiB Device allocated: 11.24TiB Device unallocated: 6.93TiB Device missing: 0.00B Used: 11.21TiB Free (estimated): 6.97TiB (min: 3.50TiB) Free (statfs, df): 6.97TiB Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Multiple profiles: no Data Metadata System Mode: i1,s0,p0 i2,s0,p0 i2,s0,p0 Redundancy: 0 1 1 ------------ -------- -------- -------- ----------- Id Path Unallocated -- --------- -------- -------- -------- ----------- 3 /dev/sdb2 5.61TiB 17.00GiB 32.00MiB 3.47TiB 4 /dev/sdd2 5.60TiB 17.00GiB 32.00MiB 3.47TiB -- --------- -------- -------- -------- ----------- Total 11.21TiB 17.00GiB 32.00MiB 6.94TiB Used 11.18TiB 15.76GiB 1.31MiB Thanks ~Forza [1] https://www.techopedia.com/definition/16325/instance
On Sat, Jul 24, 2021 at 03:54:18PM +0200, Forza wrote: > > > On 2021-07-24 14:30, Hugo Mills wrote: > > On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: > > > David Sterba wrote: > > > > > > > Maybe it's still too new so nobody is used to it and we've always had > > > > problems with the raid naming scheme anyway. > > > > > > Perhaps slightly off topic , but I see constantly that people do not > > > understand how BTRFS "RAID" implementation works. They tend to confuse it > > > with regular RAID and get angry because they run into "issues" simply > > > because they do not understand the differences. > > > > > > I have been an enthusiastic BTRFS user for years, and I actually caught > > > myself incorrectly explaining how regular RAID works to a guy a while ago. > > > This happened simply because my mind was so used to how BTRFS uses this > > > terminology that I did not think about it. > > > > > > As BTRFS is getting used more and more it may be increasingly difficult (if > > > not impossible) to get rid of the "RAID" terminology, but in my opinion also > > > increasingly more important as well. > > > > > > Some years ago (2018) there was some talk about a new naming scheme > > > https://marc.info/?l=linux-btrfs&m=136286324417767 > > > > > > While technically spot on I found Hugo's naming scheme difficult. It was > > > based on this idea: > > > numCOPIESnumSTRIPESnumPARITY > > > > > > 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. > > > > > > I also do not agree with the use of 'copy'. The Oxford dictionary defines > > > 'copy' as "a thing that is made to be the same as something else, especially > > > a document or a work of art" > > > And while some might argue that copying something on disk from memory makes > > > it a copy, it ceases to be a copy once the memory contents is erased. That last sentence doesn't make sense. A copy does not cease to be a copy because the original (or some upstream copy) was destroyed. I think the usage of "copy" is fine, and also better than the alternatives proposed so far. "copy" is shorter than "mirror", "replica", or "instance", and "copy" or its abbreviation "c" already appears in various btrfs documents, code, and symbolic names. Most users on IRC understand "raid1 stores 2 copies of the data" as an explanation, more successfully than "2 mirrors" or "2 replicas" which usually require clarification. As far as I can tell, "instances" has never been used to describe btrfs profile-level data replication before today. "Instance" does have the advantage of being less ambiguous. The other words imply the existence of one additional extra thing, an "original" that has been copied, mirrored, or replicated, which causes off-by-one errors when people try to understand whether "two mirrors" without context means 2 or 3 identical instances. The disadvantage of "instance" is that it abbreviates to "i", which causes problems when people write "I1" in a sans-serif font. We could use "n" instead, though people like to write "array of N" disks. Probably all of the letters are bad in some way. I'll use "n" for now, as none of the non-instance words use that letter. > > > I therefore think that replicas is a far better terminology. > > > I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far > > > more readable so if I may dare to be as bold.... Right now the btrfs profile names are all snowflakes: there is a short string name, and we have to look up in a table what behavior to expect when that name is invoked. This matches the implementation, which uses a single bit for all of the above names, and indeed does use a lookup table to get the parameters (though the code also uses far too many 'if' statements as well). Here's the list of names adjusted according to comments so far, plus some corrections (lowercase axis name, 1-based instance count, capitalize 'M', fix the number of stripes, and use correct btrfs profile names): single = n1.s1.p0 (1 instance, 1 stripe, no parity) dup = n2.s1.p0 (2 instances, 1 stripe, no parity) raid0 = n1.sM.p0 (1 instance, (1..avail_disks) stripes, no parity) raid1 = n2.s1.p0 (2 instances, 1 stripe, no parity) raid1c3 = n3.s1.p0 (3 instances, 1 stripe, no parity) raid1c4 = n4.s1.p0 (4 instances, 1 stripe, no parity) raid10 = n1.sM.p0 (2 instances, 2..floor(avail_disks / num_instances) stripes, no parity) raid5 = n1.sM.p1 (1 instance, 1..(avail_disks - parity) stripes, 1 parity) raid6 = n1.sM.p2 (1 instance, 1..(avail_disks - parity) stripes, 2 parity) There are three problems: 1. Orthogonal naming conventions make it look like we should be able to derive parameters by analyzing the names (e.g. RAID5 on RAID1 mirror pairs is "n2.sM.p1", RAID1 of two RAID5 groups is...also "n2.sM.p1"...OK there are 3 problems), but btrfs implements exactly the 9 profiles above and nothing else. We can only use the subset of orthogonal names that exist in the implementation. 2. The orthogonal-looking dimensions are not fully independent. Is "n2.sM.p1" RAID5 of RAID1 mirror pairs, or RAID1 of 2x RAID5 parity groups? Arguably that's not so bad since neither exist yet--we could say it must be the first one, since that's closer to what RAID10 does, so it would basically be RAID10 with an extra (pair of) parity devices. 3. Future profiles and even some of the existing profiles do not fit in the orthogonal dimensions of this naming convention. All the existing btrfs profiles except dup can be derived from a formula with these 3 parameters. 2 copies implies min_devs == 2, so "n2.s1.p0" is raid1, not dup. There's no way to specify dup with the 3 parameters given. That might not be a problem--we could say "nope, 'dup' does not fit the pattern, so we keep calling that profile 'dup'. It's a special non-orthogonal case, and users must read the manual to see exactly what it does." Note that all forms of overloading the "s" parameter are equivalent to an exceptional name for dup or a new orthogonal parameter, because there's no integer we can plug into RAID layout formulas to get the btrfs dup profile behavior. RAID-Z from ZFS or the proposed extent-tree-v2 stripe tree would be named "n1.sM.p1", the same name we have for the existing btrfs raid5 profile, despite having completely different parity mechanics. We'd need another dimension in the profile name to describe how the parity blocks are managed. We can't write something like "pZ" or "pM" because we still need the number of parity blocks. If we write something like "pz2" or "pm1" (for raid-z-style 2-parity or stripe-tree 1-parity respectively) we are still sending the user to the manual to understand what the extra letter means, and we're not much further ahead. The window for defining these names seems to have closed about 13 years ago. Even if we agreed the orthogonal names are better, we'd have to make "n2.s1.p0" a non-canonical alias of the canonical "raid1" name so we don't instantly break over a decade of existing tooling, documentation, and user training. We'd still have to explain over and over to new users that "raid1" is "n2.s1.p0" in btrfs, not "nM.s1.p0" as it is in mdadm raid1. TL;DR I think this is somewhere between "not a good idea" and "a bad idea" for naming storage profiles; however, there are some places where we can still use parts of the proposal. If the proposal includes _implementing_ an orthogonally parameterized umbrella raid profile, then the above objection doesn't apply--but to fit into btrfs with its single-bit-per-profile on-disk schema, all of the orthogonally named raid profiles would be parameters for one new btrfs "parameterized raid" profile. We'd still need to have all the old profiles with their old names to be compatible with old on-disk filesystems, even if they were internally implemented by translating to the parameterized implementation. What I do like about this proposal is that it is useful as a _reporting_ format, and the reporting tools need help with that. Right now we have "RAID10" or "RAID10/6" for "RAID10 over 6 disks". We could have tools report "RAID/n2.s3.p0" instead, assuming we fixed all the quirky parts of btrfs where the specific profile bit matters. If we don't fix the quirks, then we still need the full btrfs profile bit name reported, because we need to be able to distinguish between "single/n1.s1.p0" and "RAID0/n1.s1.p0" in a few corner cases. "dup" would still be "dup", though it might have a suffix for the number of distinct devices that are used for the instances (e.g. "dup/d2" for instances on 2 devices, vs "dup/d1" for both instances on the same device). The suffix doesn't have to fit into the orthogonal naming convention because the profile can select the parameter space. If we implement a new btrfs raid profile that doesn't fit the pattern, we can use the new profile's name as a prefix, e.g. on 5 disks, RAID-Z1 "raidz/n1.s4.p1" is distinguishable from btrfs RAID5 "raid5/n1.s4.p1", so it's clear how many copies, parity, stripes, and disks there are, and which parity mechanism is used. > > Sorry, I missed a detail here that someone pointed out on IRC. > > > > "r0" makes no sense to me, as that says there's no data. I would > > argue strongly to add one to all of your r values. (Note that > > "RAID1c2" isn't one of the current btrfs RAID levels, and by extension > > from the others, it's equivalent to the current RAID1, and we have > > RAID1c4 which is four complete instances of any item of data). > > > > My proposal counted *instances* of the data, not the redundancy. > > > > Hugo. > > > > I think Hugu is right that the terminology of "instance"[1] is easier to > understand than copies or replicas. > > Example: > "single" would be 1 instance > "dup" would be 2 instances > "raid1" would be 2 instances, 1 stripe, 0 parity > "raid1c3" would be 3 instances, 1 stripe, 0 parity > "raid1c4" would be 4 instances, 1 stripe, 0 parity > ... and so on. > > Shortened we could then use i<num>.s<num>.p<num> for Instances, Stripes and > Parities. > > Do we need a specific term for level of "redundancy"? In the current > suggestions we can have redundancy either because of parity or of multiple > instances. Perhaps the output of btrfs-progs could mention redundancy level > such as this: > > # btrfs fi us /mnt/btrfs/ -T > Overall: > Device size: 18.18TiB > Device allocated: 11.24TiB > Device unallocated: 6.93TiB > Device missing: 0.00B > Used: 11.21TiB > Free (estimated): 6.97TiB (min: 3.50TiB) > Free (statfs, df): 6.97TiB > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > Multiple profiles: no > > Data Metadata System > Mode: i1,s0,p0 i2,s0,p0 i2,s0,p0 > Redundancy: 0 1 1 > ------------ -------- -------- -------- ----------- > Id Path Unallocated > -- --------- -------- -------- -------- ----------- > 3 /dev/sdb2 5.61TiB 17.00GiB 32.00MiB 3.47TiB > 4 /dev/sdd2 5.60TiB 17.00GiB 32.00MiB 3.47TiB > -- --------- -------- -------- -------- ----------- > Total 11.21TiB 17.00GiB 32.00MiB 6.94TiB > Used 11.18TiB 15.76GiB 1.31MiB > > > > > Thanks > ~Forza > > > > [1] https://www.techopedia.com/definition/16325/instance
On Fri, Jul 23, 2021 at 06:51:31AM +0800, Qu Wenruo wrote: > > > On 2021/7/23 上午3:29, David Sterba wrote: > > The data on raid0 and raid10 are supposed to be spread over multiple > > devices, so the minimum constraints are set to 2 and 4 respectively. > > This is an artificial limit and there's some interest to remove it. > > This could be a better way to solve the SINGLE chunk created by degraded > mount. > > > > > Change this to allow raid0 on one device and raid10 on two devices. This > > works as expected eg. when converting or removing devices. > > > > The only difference is when raid0 on two devices gets one device > > removed. Unpatched would silently create a single profile, while newly > > it would be raid0. > > > > The motivation is to allow to preserve the profile type as long as it > > possible for some intermediate state (device removal, conversion). > > > > Unpatched kernel will mount and use the degenerate profiles just fine > > but won't allow any operation that would not satisfy the stricter device > > number constraints, eg. not allowing to go from 3 to 2 devices for > > raid10 or various profile conversions. > > My initial thought is, tree-checker will report errors like crazy, but > no, the check for RAID1 only cares substripe, while for RAID0 no number > of devices check. > > So a good surprise here. > > > Another thing is about the single device RAID0 or 2 devices RAID10 is > the stripe splitting. > > Single device RAID0 is just SINGLE, while 2 devices RAID10 is just RAID1. > Thus they need no stripe splitting at all. > > But we will still do the stripe calculation, thus it could slightly > reduce the performance. > Not a big deal though. It might have a more significant impact on a SSD cache below btrfs. lvmcache (dm-cache) uses the IO request size to decide whether to bypass the cache or not. Scrubbing a striped-profile array through lvmcache is a disaster. All the IOs are carved up into chunks that are smaller than the cache bypass threshold, so lvmcache tries to cache all the scrub IO and will flood the SSD with enough writes to burn out a cheap SSD in a month or two. Scrubbing raid1 or single uses bigger IO requests, so lvmcache bypasses the SSD cache and there's no problem (other than incomplete scrub coverage because of the caching, but that's a whole separate issue). The kernel allows this already and it's a reasonably well-known gotcha (I didn't discover it myself, I heard/read it somewhere else and confirmed it) so there is no urgency, but some day either lvmcache should get smarter about merging adjacent short reads, or btrfs should be smarter about switching to "real" single profile behavior when it sees a striped profile with 1 stripe. > > Example output: > > > > # btrfs fi us -T . > > Overall: > > Device size: 10.00GiB > > Device allocated: 1.01GiB > > Device unallocated: 8.99GiB > > Device missing: 0.00B > > Used: 200.61MiB > > Free (estimated): 9.79GiB (min: 9.79GiB) > > Free (statfs, df): 9.79GiB > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 3.25MiB (used: 0.00B) > > Multiple profiles: no > > > > Data Metadata System > > Id Path RAID0 single single Unallocated > > -- ---------- --------- --------- -------- ----------- > > 1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB > > -- ---------- --------- --------- -------- ----------- > > Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB > > Used 200.25MiB 352.00KiB 16.00KiB > > > > # btrfs dev us . > > /dev/sda10, ID: 1 > > Device size: 10.00GiB > > Device slack: 0.00B > > Data,RAID0/1: 1.00GiB > > Can we slightly enhance the output? > RAID0/1 really looks like a new profile now, even the "1" really means > the number of device. > > > Metadata,single: 8.00MiB > > System,single: 1.00MiB > > Unallocated: 8.99GiB > > > > Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per > > profile is printed. > > > > Signed-off-by: David Sterba <dsterba@suse.com> > > Reviewed-by: Qu Wenruo <wqu@suse.com> > > Thanks, > Qu > > --- > > fs/btrfs/volumes.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index 86846d6e58d0..ad943357072b 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -38,7 +38,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > > .sub_stripes = 2, > > .dev_stripes = 1, > > .devs_max = 0, /* 0 == as many as possible */ > > - .devs_min = 4, > > + .devs_min = 2, > > .tolerated_failures = 1, > > .devs_increment = 2, > > .ncopies = 2, > > @@ -103,7 +103,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { > > .sub_stripes = 1, > > .dev_stripes = 1, > > .devs_max = 0, > > - .devs_min = 2, > > + .devs_min = 1, > > .tolerated_failures = 0, > > .devs_increment = 1, > > .ncopies = 1, > >
Hugo Mills wrote: > On Sat, Jul 24, 2021 at 01:04:19PM +0200, waxhead wrote: >> David Sterba wrote: >> >>> Maybe it's still too new so nobody is used to it and we've always had >>> problems with the raid naming scheme anyway. >> >> Perhaps slightly off topic , but I see constantly that people do not >> understand how BTRFS "RAID" implementation works. They tend to confuse it >> with regular RAID and get angry because they run into "issues" simply >> because they do not understand the differences. >> >> I have been an enthusiastic BTRFS user for years, and I actually caught >> myself incorrectly explaining how regular RAID works to a guy a while ago. >> This happened simply because my mind was so used to how BTRFS uses this >> terminology that I did not think about it. >> >> As BTRFS is getting used more and more it may be increasingly difficult (if >> not impossible) to get rid of the "RAID" terminology, but in my opinion also >> increasingly more important as well. >> >> Some years ago (2018) there was some talk about a new naming scheme >> https://marc.info/?l=linux-btrfs&m=136286324417767 >> >> While technically spot on I found Hugo's naming scheme difficult. It was >> based on this idea: >> numCOPIESnumSTRIPESnumPARITY >> >> 1CmS1P = Raid5 or 1 copy, max stripes, 1 parity. >> >> I also do not agree with the use of 'copy'. The Oxford dictionary defines >> 'copy' as "a thing that is made to be the same as something else, especially >> a document or a work of art" >> >> And while some might argue that copying something on disk from memory makes >> it a copy, it ceases to be a copy once the memory contents is erased. I >> therefore think that replicas is a far better terminology. >> >> I earlier suggested Rnum.Snum.Pnum as a naming scheme which I think is far >> more readable so if I may dare to be as bold.... >> >> SINGLE = R0.S0.P0 (no replicas, no stripes (any device), no parity) >> DUP = R1.S1.P0 (1 replica, 1 stripe (one device), no parity) >> RAID0 = R0.Sm.P0 (no replicas, max stripes, no parity) >> RAID1 = R1.S0.P0 (1 replica, no stripes (any device), no parity) >> RAID1c2 = R2.S0.P0 (2 replicas, no stripes (any device), no parity) >> RAID1c3 = R3.S0.P0 (3 replicas, no stripes (any device), no parity) >> RAID10 = R1.Sm.P0 (1 replica, max stripes, no parity) >> RAID5 = R0.Sm.P1 (no replicas, max stripes, 1 parity) >> RAID5 = R0.Sm.P2 (no replicas, max stripes, 2 parity) > > Sorry, I missed a detail here that someone pointed out on IRC. > > "r0" makes no sense to me, as that says there's no data. I would > argue strongly to add one to all of your r values. (Note that I disagree. R0 means no replicas of stored data. All filesystems have stored data so adding one seems rather pointless to me. R0 would easily be interpreted as you can loose NO device. Rn would also easily be interpreted as you can loose n device(s). > "RAID1c2" isn't one of the current btrfs RAID levels, and by extension > from the others, it's equivalent to the current RAID1, and we have > RAID1c4 which is four complete instances of any item of data). > > My proposal counted *instances* of the data, not the redundancy. > > Hugo. > You are correct that RAID1c2 does not exist, my mistake- it should have been RAID1c3 which is r2.s0.p0 which is 3 instances of data. I am not directly against using instances of data, but since all filesystems have at least one instance to be meaningful I personally prefer using replicas as this would also clearer indicate how many devices you can loose. (for instances you have to subtract one anyway).
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 86846d6e58d0..ad943357072b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -38,7 +38,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .sub_stripes = 2, .dev_stripes = 1, .devs_max = 0, /* 0 == as many as possible */ - .devs_min = 4, + .devs_min = 2, .tolerated_failures = 1, .devs_increment = 2, .ncopies = 2, @@ -103,7 +103,7 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { .sub_stripes = 1, .dev_stripes = 1, .devs_max = 0, - .devs_min = 2, + .devs_min = 1, .tolerated_failures = 0, .devs_increment = 1, .ncopies = 1,
The data on raid0 and raid10 are supposed to be spread over multiple devices, so the minimum constraints are set to 2 and 4 respectively. This is an artificial limit and there's some interest to remove it. Change this to allow raid0 on one device and raid10 on two devices. This works as expected eg. when converting or removing devices. The only difference is when raid0 on two devices gets one device removed. Unpatched would silently create a single profile, while newly it would be raid0. The motivation is to allow to preserve the profile type as long as it possible for some intermediate state (device removal, conversion). Unpatched kernel will mount and use the degenerate profiles just fine but won't allow any operation that would not satisfy the stricter device number constraints, eg. not allowing to go from 3 to 2 devices for raid10 or various profile conversions. Example output: # btrfs fi us -T . Overall: Device size: 10.00GiB Device allocated: 1.01GiB Device unallocated: 8.99GiB Device missing: 0.00B Used: 200.61MiB Free (estimated): 9.79GiB (min: 9.79GiB) Free (statfs, df): 9.79GiB Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 3.25MiB (used: 0.00B) Multiple profiles: no Data Metadata System Id Path RAID0 single single Unallocated -- ---------- --------- --------- -------- ----------- 1 /dev/sda10 1.00GiB 8.00MiB 1.00MiB 8.99GiB -- ---------- --------- --------- -------- ----------- Total 1.00GiB 8.00MiB 1.00MiB 8.99GiB Used 200.25MiB 352.00KiB 16.00KiB # btrfs dev us . /dev/sda10, ID: 1 Device size: 10.00GiB Device slack: 0.00B Data,RAID0/1: 1.00GiB Metadata,single: 8.00MiB System,single: 1.00MiB Unallocated: 8.99GiB Note "Data,RAID0/1", with btrfs-progs 5.13+ the number of devices per profile is printed. Signed-off-by: David Sterba <dsterba@suse.com> --- fs/btrfs/volumes.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)