Message ID | cover.1652428644.git.wqu@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: cleanups and preparation for the incoming RAID56J features | expand |
On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote: > Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J > (J for journal), Do you need to introduce completely new profiles? IIRC in my drafts I was more inclined to reuse a dedicated raid1c3 block group, of an arbitrary size and not used by anything else. A new profile would technically work too but it brings other issues. As a related feature to the raid56, I was working on the striped raid10c34 profiles but was not able to make it work. In a sense this is easier as it reuses existing code, but if you make the journal work we won't need that. > if we're relying on ad-hoc if () else if () branches to > calculate thing like number of p/q stripes, it will cause a lot of > problems. > > In fact, during my development, I have hit tons of bugs related to this. > > One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot > to update the check for RAID5 and RAID6 profiles, we will got a bad > nr_data == num_stripes, and screw up later writeback. > > 90% of my suffering comes from such ad-hoc usage doing manual checks on > RAID56. > > Another example is, scrub_stripe() which due to the extra per-device > reservation, @dev_extent_len is no longer the same the data stripe > length calculated from extent_map. > > So this patchset will do the following cleanups preparing for the > incoming RAID56J (already finished coding, functionality and on-disk > format are fine, although no journal yet): > > - Calculate device stripe length in-house inside scrub_stripe() > This removes one of the nasty mismatch which is less obvious. > > - Use btrfs_raid_array[] based calculation instead of ad-hoc check > The only exception is scrub_nr_raid_mirrors(), which has several > difference against btrfs_num_copies(): > > * No iteration on all RAID6 combinations > No sure if it's planned or not. > > * Use bioc->num_stripes directly > In that context, bioc is already all the mirrors for the same > stripe, thus no need to lookup using btrfs_raid_array[]. > > With all these cleanups, the RAID56J will be much easier to implement. > > Qu Wenruo (4): > btrfs: remove @dev_extent_len argument from scrub_stripe() function > btrfs: use btrfs_chunk_max_errors() to replace weird tolerance > calculation > btrfs: use btrfs_raid_array[] to calculate the number of parity > stripes > btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() The preparatory patches look good to me.
On 2022/5/13 19:38, David Sterba wrote: > On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote: >> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J >> (J for journal), > > Do you need to introduce completely new profiles? IIRC in my drafts I > was more inclined to reuse a dedicated raid1c3 block group, of an > arbitrary size and not used by anything else. A new profile would > technically work too but it brings other issues. The point here is, I want to reserve some space for each data stripe of a RAID56 chunk. Unfortunately, we need some on-disk format change anyway, for reusing btrfs_chunk::io_align to represent how many bytes are reserved for each dev extent, before where the data really starts. So new RAID56 types are needed anyway. Sure, this will make dev_extent mismatch from pure calculated extent length from a chunk map. But all the involved change are not that huge, still reviewable. From my latest version, already under fstests (with new RAID56J profiles added to related test cases), but without real journal code yet, the patch looks like this: fs/btrfs/block-group.c | 23 ++++++- fs/btrfs/ctree.h | 7 +- fs/btrfs/scrub.c | 15 ++-- fs/btrfs/tree-checker.c | 4 ++ fs/btrfs/volumes.c | 118 +++++++++++++++++++++++++++++--- fs/btrfs/volumes.h | 7 +- fs/btrfs/zoned.c | 2 + include/uapi/linux/btrfs.h | 1 + include/uapi/linux/btrfs_tree.h | 30 +++++++- 9 files changed, 185 insertions(+), 22 deletions(-) Most changes are involved in dev extent handling, and some sites can not use btrfs_raid_array[] directly. I guess you'll see that kernel patch soon, along with needed progs patch in next week. Thanks, Qu > > As a related feature to the raid56, I was working on the striped > raid10c34 profiles but was not able to make it work. In a sense this is > easier as it reuses existing code, but if you make the journal work we > won't need that. > >> if we're relying on ad-hoc if () else if () branches to >> calculate thing like number of p/q stripes, it will cause a lot of >> problems. >> >> In fact, during my development, I have hit tons of bugs related to this. >> >> One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot >> to update the check for RAID5 and RAID6 profiles, we will got a bad >> nr_data == num_stripes, and screw up later writeback. >> >> 90% of my suffering comes from such ad-hoc usage doing manual checks on >> RAID56. >> >> Another example is, scrub_stripe() which due to the extra per-device >> reservation, @dev_extent_len is no longer the same the data stripe >> length calculated from extent_map. >> >> So this patchset will do the following cleanups preparing for the >> incoming RAID56J (already finished coding, functionality and on-disk >> format are fine, although no journal yet): >> >> - Calculate device stripe length in-house inside scrub_stripe() >> This removes one of the nasty mismatch which is less obvious. >> >> - Use btrfs_raid_array[] based calculation instead of ad-hoc check >> The only exception is scrub_nr_raid_mirrors(), which has several >> difference against btrfs_num_copies(): >> >> * No iteration on all RAID6 combinations >> No sure if it's planned or not. >> >> * Use bioc->num_stripes directly >> In that context, bioc is already all the mirrors for the same >> stripe, thus no need to lookup using btrfs_raid_array[]. >> >> With all these cleanups, the RAID56J will be much easier to implement. >> >> Qu Wenruo (4): >> btrfs: remove @dev_extent_len argument from scrub_stripe() function >> btrfs: use btrfs_chunk_max_errors() to replace weird tolerance >> calculation >> btrfs: use btrfs_raid_array[] to calculate the number of parity >> stripes >> btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() > > The preparatory patches look good to me.
Hi, ---- From: Qu Wenruo <wqu@suse.com> -- Sent: 2022-05-13 - 10:34 ---- > Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J > (J for journal), Great to see work being done on the RAID56 parts of Btrfs. :) I am just a user of btrfs and don't have the full understanding of the internals, but it makes me a little curious that we choose to use journals and RMW instead of a CoW solution to solve the write hole. Since we need on-disk changes to implement it, could it not be better to rethink the raid56 modes and implement a solution with full CoW, such as variable stripe extents etc? It is likely much more work, but could have better performance because it avoids double writes and RMW cycles too. Thanks Forza
On 2022/5/13 23:14, Forza wrote: > Hi, > > ---- From: Qu Wenruo <wqu@suse.com> -- Sent: 2022-05-13 - 10:34 ---- > >> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J >> (J for journal), > > Great to see work being done on the RAID56 parts of Btrfs. :) > > I am just a user of btrfs and don't have the full understanding of the internals, but it makes me a little curious that we choose to use journals and RMW instead of a CoW solution to solve the write hole. In fact, Johannes from WDC is already working on a pure CoW based solution, called stripe tree. The idea there is to introduce a new layer of mapping. With that stripe tree, inside one chunk, the logical bytenr is no longer directly mapped to a physical location, but can be dynamically mapped to any physical location inside the chunk range. So previously if we have a RAID56 chunk with 3 disks looks like this: Logical bytenr X X + 64K X + 128K | | | Then we have the following on-disk mapping: [X, X + 64K): Devid 1 Physical Y1 [X + 64K, X + 128K) Devid 2 Physical Y2 Parity for above Devid 3 Physical Y3 So if we just write 64K into logical bytenr X, we need to read out data at logical bytenr [X + 64K, X + 128K), then calculate the parity, write into devid3 physical offset Y3. But with the new stripe tree, we can map [X, X + 64K) into any location in the devid 1. So is [X + 64K, X + 128K) and the parity. Then we we write data into logical bytenr [X, X + 64K), then we just find a free 64K range in stripe tree of devid 1, and check if we have mapped [X + 64K, X + 128) in the stripe tree. a) Mapped If we have [X + 64K, X + 128) mapped, then we read that range out, update our parity stripe, and write the parity stripe into some newer location (CoW), then free up the old stripe. b) Not mapped This means we don't have any data write into that range, thus it is all zero. We calculate parity with all zero, then find a new location for parity in devid 3, write the newly calculated parity and insert a mapping for the new parity location. By this, we in fact decouple the 1:1 mapping for RAID56, and get way more flexibility. Although this idea no longer follows the strict rotation of RAID5, thus it's a middle ground between RAID4 and RAID5. The brilliant idea is introduced mostly to support different chunk profiles for zoned devices, but Johannes is working on enabling this for non-zoned devices too. Then you may ask why I'm still pushing this way more traditional RAID56J solution, the reasons are: - Complexity The stripe tree is flexible, thus more complex. And AFAIK it will affect all chunk types, not only RAID56. Thus it can be more challenging. - Currently relies on zoned unit to split extents/stripes Thus I believe Johannes can solve it without any problems. - I just want a valid way to steal code from dm/md guys :) Don't get me wrong, I totally believe stripe tree can be the silver bullet, but it doesn't prevent us to explore some different (and more traditional) ways. > > > Since we need on-disk changes to implement it, could it not be better to rethink the raid56 modes and implement a solution with full CoW, such as variable stripe extents etc? It is likely much more work, but could have better performance because it avoids double writes and RMW cycles too. Well, the journal will have optimizations, e.g. full stripe doesn't need to journal its data. I'll learn (steal code) from dm/md to implement the code. But there are problems related in RAID56, affecting dm/md raid56 too. Like bitrot in one data stripe, while we're writing data into the other data stripe. Then RWM will read out the bad data stripe, calculate parity, and cause the bit rot permanent. The destructive RMW will not be detected in traditional raid56 with traditional fs, but can be detected by btrfs. Thus after the RAID56J project, I'll take more time on that destructive RMW problem. Thanks, Qu > > Thanks > > Forza >
On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote: > Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J > (J for journal), if we're relying on ad-hoc if () else if () branches to > calculate thing like number of p/q stripes, it will cause a lot of > problems. > > In fact, during my development, I have hit tons of bugs related to this. > > One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot > to update the check for RAID5 and RAID6 profiles, we will got a bad > nr_data == num_stripes, and screw up later writeback. > > 90% of my suffering comes from such ad-hoc usage doing manual checks on > RAID56. > > Another example is, scrub_stripe() which due to the extra per-device > reservation, @dev_extent_len is no longer the same the data stripe > length calculated from extent_map. > > So this patchset will do the following cleanups preparing for the > incoming RAID56J (already finished coding, functionality and on-disk > format are fine, although no journal yet): > > - Calculate device stripe length in-house inside scrub_stripe() > This removes one of the nasty mismatch which is less obvious. > > - Use btrfs_raid_array[] based calculation instead of ad-hoc check > The only exception is scrub_nr_raid_mirrors(), which has several > difference against btrfs_num_copies(): > > * No iteration on all RAID6 combinations > No sure if it's planned or not. > > * Use bioc->num_stripes directly > In that context, bioc is already all the mirrors for the same > stripe, thus no need to lookup using btrfs_raid_array[]. > > With all these cleanups, the RAID56J will be much easier to implement. > > Qu Wenruo (4): > btrfs: remove @dev_extent_len argument from scrub_stripe() function > btrfs: use btrfs_chunk_max_errors() to replace weird tolerance > calculation > btrfs: use btrfs_raid_array[] to calculate the number of parity > stripes > btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() Added to misc-next, thanks.