Message ID | cover.1652711187.git.johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: introduce raid-stripe-tree | expand |
On Mon, May 16, 2022 at 07:31:35AM -0700, Johannes Thumshirn wrote: > Introduce a raid-stripe-tree to record writes in a RAID environment. > > In essence this adds another address translation layer between the logical > and the physical addresses in btrfs and is designed to close two gaps. The > first is the ominous RAID-write-hole we suffer from with RAID5/6 and the > second one is the inability of doing RAID with zoned block devices due to the > constraints we have with REQ_OP_ZONE_APPEND writes. > > Thsi is an RFC/PoC only which just shows how the code will look like for a > zoned RAID1. Its sole purpose is to facilitate design reviews and is not > intended to be merged yet. Or if merged to be used on an actual file-system. > This is hard to talk about without seeing the code to add the raid extents and such. Reading it makes sense, but I don't know how often the stripes are meant to change. Are they static once they're allocated, like dev extents? I can't quite fit in my head the relationship with the rest of the allocation system. Are they coupled with the logical extent that gets allocated? Or are they coupled with the dev extent? Are they somewhere in between? Also I realize this is an RFC, but we're going to need some caching for reads so we're not having to do a tree search on every IO with the RAID stripe tree in place. Thanks, Josef
On 16/05/2022 16:58, Josef Bacik wrote: > On Mon, May 16, 2022 at 07:31:35AM -0700, Johannes Thumshirn wrote: >> Introduce a raid-stripe-tree to record writes in a RAID environment. >> >> In essence this adds another address translation layer between the logical >> and the physical addresses in btrfs and is designed to close two gaps. The >> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >> second one is the inability of doing RAID with zoned block devices due to the >> constraints we have with REQ_OP_ZONE_APPEND writes. >> >> Thsi is an RFC/PoC only which just shows how the code will look like for a >> zoned RAID1. Its sole purpose is to facilitate design reviews and is not >> intended to be merged yet. Or if merged to be used on an actual file-system. >> > > This is hard to talk about without seeing the code to add the raid extents and > such. Reading it makes sense, but I don't know how often the stripes are meant > to change. Are they static once they're allocated, like dev extents? I can't > quite fit in my head the relationship with the rest of the allocation system. > Are they coupled with the logical extent that gets allocated? Or are they > coupled with the dev extent? Are they somewhere in between? The stripe extents have a 1:1 relationship file the file-extents, i.e: stripe_extent_key.objectid = btrfs_file_extent_item.disk_bytenr; stripe_extent_type = BTRFS_RAID_STRIPE_EXTENT; stripe_extent_offset = btrfs_file_extent_item.disk_num_bytes; > Also I realize this is an RFC, but we're going to need some caching for reads so > we're not having to do a tree search on every IO with the RAID stripe tree in > place. Do we really need to do caching of stripe tree entries? They're read, once the corresponding btrfs_file_extent_item is read from disk, which then gets cached in the page cache. Every override is cached in the page cache as well. If we're flushing the cache, we need to re-read both, the file_extent_item and the stripe extents.
On Mon, May 16, 2022 at 03:04:35PM +0000, Johannes Thumshirn wrote: > On 16/05/2022 16:58, Josef Bacik wrote: > > On Mon, May 16, 2022 at 07:31:35AM -0700, Johannes Thumshirn wrote: > >> Introduce a raid-stripe-tree to record writes in a RAID environment. > >> > >> In essence this adds another address translation layer between the logical > >> and the physical addresses in btrfs and is designed to close two gaps. The > >> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the > >> second one is the inability of doing RAID with zoned block devices due to the > >> constraints we have with REQ_OP_ZONE_APPEND writes. > >> > >> Thsi is an RFC/PoC only which just shows how the code will look like for a > >> zoned RAID1. Its sole purpose is to facilitate design reviews and is not > >> intended to be merged yet. Or if merged to be used on an actual file-system. > >> > > > > This is hard to talk about without seeing the code to add the raid extents and > > such. Reading it makes sense, but I don't know how often the stripes are meant > > to change. Are they static once they're allocated, like dev extents? I can't > > quite fit in my head the relationship with the rest of the allocation system. > > Are they coupled with the logical extent that gets allocated? Or are they > > coupled with the dev extent? Are they somewhere in between? > > The stripe extents have a 1:1 relationship file the file-extents, i.e: > > stripe_extent_key.objectid = btrfs_file_extent_item.disk_bytenr; > stripe_extent_type = BTRFS_RAID_STRIPE_EXTENT; > stripe_extent_offset = btrfs_file_extent_item.disk_num_bytes; > > > > Also I realize this is an RFC, but we're going to need some caching for reads so > > we're not having to do a tree search on every IO with the RAID stripe tree in > > place. > > Do we really need to do caching of stripe tree entries? They're read, once the > corresponding btrfs_file_extent_item is read from disk, which then gets cached > in the page cache. Every override is cached in the page cache as well. > > If we're flushing the cache, we need to re-read both, the file_extent_item and > the stripe extents. Yup ok if we're 1:1 with the file-extents then we don't want the whole tree striped. Since we're 1:1 with the file-extents please make the stripe tree follow the same convention as the global roots, at least put the load code in the same area as the csum/fst/extent tree, if your stuff gets merged and turned on before extnet tree v2 it'll be easier for me to adapt it. Thanks, Josef
On 16/05/2022 17:10, Josef Bacik wrote: > On Mon, May 16, 2022 at 03:04:35PM +0000, Johannes Thumshirn wrote: >> On 16/05/2022 16:58, Josef Bacik wrote: >>> On Mon, May 16, 2022 at 07:31:35AM -0700, Johannes Thumshirn wrote: >>>> Introduce a raid-stripe-tree to record writes in a RAID environment. >>>> >>>> In essence this adds another address translation layer between the logical >>>> and the physical addresses in btrfs and is designed to close two gaps. The >>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >>>> second one is the inability of doing RAID with zoned block devices due to the >>>> constraints we have with REQ_OP_ZONE_APPEND writes. >>>> >>>> Thsi is an RFC/PoC only which just shows how the code will look like for a >>>> zoned RAID1. Its sole purpose is to facilitate design reviews and is not >>>> intended to be merged yet. Or if merged to be used on an actual file-system. >>>> >>> >>> This is hard to talk about without seeing the code to add the raid extents and >>> such. Reading it makes sense, but I don't know how often the stripes are meant >>> to change. Are they static once they're allocated, like dev extents? I can't >>> quite fit in my head the relationship with the rest of the allocation system. >>> Are they coupled with the logical extent that gets allocated? Or are they >>> coupled with the dev extent? Are they somewhere in between? >> >> The stripe extents have a 1:1 relationship file the file-extents, i.e: >> >> stripe_extent_key.objectid = btrfs_file_extent_item.disk_bytenr; >> stripe_extent_type = BTRFS_RAID_STRIPE_EXTENT; >> stripe_extent_offset = btrfs_file_extent_item.disk_num_bytes; >> >> >>> Also I realize this is an RFC, but we're going to need some caching for reads so >>> we're not having to do a tree search on every IO with the RAID stripe tree in >>> place. >> >> Do we really need to do caching of stripe tree entries? They're read, once the >> corresponding btrfs_file_extent_item is read from disk, which then gets cached >> in the page cache. Every override is cached in the page cache as well. >> >> If we're flushing the cache, we need to re-read both, the file_extent_item and >> the stripe extents. > > Yup ok if we're 1:1 with the file-extents then we don't want the whole tree > striped. > > Since we're 1:1 with the file-extents please make the stripe tree follow the > same convention as the global roots, at least put the load code in the same area > as the csum/fst/extent tree, if your stuff gets merged and turned on before > extnet tree v2 it'll be easier for me to adapt it. Thanks, Sure. I know that there once will be the need to have meta-data in the RST, but then page cache should do the trick for as us well, as it's hanging off the btree inode, isn't it?
On 16.05.22 г. 17:31 ч., Johannes Thumshirn wrote: > Introduce a raid-stripe-tree to record writes in a RAID environment. > > In essence this adds another address translation layer between the logical > and the physical addresses in btrfs and is designed to close two gaps. The > first is the ominous RAID-write-hole we suffer from with RAID5/6 and the > second one is the inability of doing RAID with zoned block devices due to the > constraints we have with REQ_OP_ZONE_APPEND writes. > > Thsi is an RFC/PoC only which just shows how the code will look like for a > zoned RAID1. Its sole purpose is to facilitate design reviews and is not > intended to be merged yet. Or if merged to be used on an actual file-system. > > Johannes Thumshirn (8): > btrfs: add raid stripe tree definitions > btrfs: move btrfs_io_context to volumes.h > btrfs: read raid-stripe-tree from disk > btrfs: add boilerplate code to insert raid extent > btrfs: add code to delete raid extent > btrfs: add code to read raid extent > btrfs: zoned: allow zoned RAID1 > btrfs: add raid stripe tree pretty printer > > fs/btrfs/Makefile | 2 +- > fs/btrfs/ctree.c | 1 + > fs/btrfs/ctree.h | 29 ++++ > fs/btrfs/disk-io.c | 12 ++ > fs/btrfs/extent-tree.c | 9 ++ > fs/btrfs/file.c | 1 - > fs/btrfs/print-tree.c | 21 +++ > fs/btrfs/raid-stripe-tree.c | 251 ++++++++++++++++++++++++++++++++ > fs/btrfs/raid-stripe-tree.h | 39 +++++ > fs/btrfs/volumes.c | 44 +++++- > fs/btrfs/volumes.h | 93 ++++++------ > fs/btrfs/zoned.c | 39 +++++ > include/uapi/linux/btrfs.h | 1 + > include/uapi/linux/btrfs_tree.h | 17 +++ > 14 files changed, 509 insertions(+), 50 deletions(-) > create mode 100644 fs/btrfs/raid-stripe-tree.c > create mode 100644 fs/btrfs/raid-stripe-tree.h > So if we choose to go with raid stripe tree this means we won't need the raid56j code that Qu is working on ? So it's important that these two work streams are synced so we don't duplicate effort, right?
On 2022/5/17 15:23, Nikolay Borisov wrote: > > > On 16.05.22 г. 17:31 ч., Johannes Thumshirn wrote: >> Introduce a raid-stripe-tree to record writes in a RAID environment. >> >> In essence this adds another address translation layer between the >> logical >> and the physical addresses in btrfs and is designed to close two gaps. >> The >> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >> second one is the inability of doing RAID with zoned block devices due >> to the >> constraints we have with REQ_OP_ZONE_APPEND writes. >> >> Thsi is an RFC/PoC only which just shows how the code will look like >> for a >> zoned RAID1. Its sole purpose is to facilitate design reviews and is not >> intended to be merged yet. Or if merged to be used on an actual >> file-system. >> >> Johannes Thumshirn (8): >> btrfs: add raid stripe tree definitions >> btrfs: move btrfs_io_context to volumes.h >> btrfs: read raid-stripe-tree from disk >> btrfs: add boilerplate code to insert raid extent >> btrfs: add code to delete raid extent >> btrfs: add code to read raid extent >> btrfs: zoned: allow zoned RAID1 >> btrfs: add raid stripe tree pretty printer >> >> fs/btrfs/Makefile | 2 +- >> fs/btrfs/ctree.c | 1 + >> fs/btrfs/ctree.h | 29 ++++ >> fs/btrfs/disk-io.c | 12 ++ >> fs/btrfs/extent-tree.c | 9 ++ >> fs/btrfs/file.c | 1 - >> fs/btrfs/print-tree.c | 21 +++ >> fs/btrfs/raid-stripe-tree.c | 251 ++++++++++++++++++++++++++++++++ >> fs/btrfs/raid-stripe-tree.h | 39 +++++ >> fs/btrfs/volumes.c | 44 +++++- >> fs/btrfs/volumes.h | 93 ++++++------ >> fs/btrfs/zoned.c | 39 +++++ >> include/uapi/linux/btrfs.h | 1 + >> include/uapi/linux/btrfs_tree.h | 17 +++ >> 14 files changed, 509 insertions(+), 50 deletions(-) >> create mode 100644 fs/btrfs/raid-stripe-tree.c >> create mode 100644 fs/btrfs/raid-stripe-tree.h >> > > > So if we choose to go with raid stripe tree this means we won't need the > raid56j code that Qu is working on ? So it's important that these two > work streams are synced so we don't duplicate effort, right? I believe the stripe tree is going to change the definition of RAID56. It's no longer strict RAID56, as it doesn't contain the fixed device rotation, thus it's kinda between RAID4 and RAID5. Personally speaking, I think both features can co-exist, especially the raid56 stripe tree may need extra development and review, since the extra translation layer is a completely different monster when comes to RAID56. Don't get me wrong, I like stripe-tree too, the only problem is it's just too new, thus we may want a backup plan. Thanks, Qu
On 17/05/2022 09:23, Nikolay Borisov wrote: > > > On 16.05.22 г. 17:31 ч., Johannes Thumshirn wrote: >> Introduce a raid-stripe-tree to record writes in a RAID environment. >> >> In essence this adds another address translation layer between the logical >> and the physical addresses in btrfs and is designed to close two gaps. The >> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >> second one is the inability of doing RAID with zoned block devices due to the >> constraints we have with REQ_OP_ZONE_APPEND writes. >> >> Thsi is an RFC/PoC only which just shows how the code will look like for a >> zoned RAID1. Its sole purpose is to facilitate design reviews and is not >> intended to be merged yet. Or if merged to be used on an actual file-system. >> >> Johannes Thumshirn (8): >> btrfs: add raid stripe tree definitions >> btrfs: move btrfs_io_context to volumes.h >> btrfs: read raid-stripe-tree from disk >> btrfs: add boilerplate code to insert raid extent >> btrfs: add code to delete raid extent >> btrfs: add code to read raid extent >> btrfs: zoned: allow zoned RAID1 >> btrfs: add raid stripe tree pretty printer >> >> fs/btrfs/Makefile | 2 +- >> fs/btrfs/ctree.c | 1 + >> fs/btrfs/ctree.h | 29 ++++ >> fs/btrfs/disk-io.c | 12 ++ >> fs/btrfs/extent-tree.c | 9 ++ >> fs/btrfs/file.c | 1 - >> fs/btrfs/print-tree.c | 21 +++ >> fs/btrfs/raid-stripe-tree.c | 251 ++++++++++++++++++++++++++++++++ >> fs/btrfs/raid-stripe-tree.h | 39 +++++ >> fs/btrfs/volumes.c | 44 +++++- >> fs/btrfs/volumes.h | 93 ++++++------ >> fs/btrfs/zoned.c | 39 +++++ >> include/uapi/linux/btrfs.h | 1 + >> include/uapi/linux/btrfs_tree.h | 17 +++ >> 14 files changed, 509 insertions(+), 50 deletions(-) >> create mode 100644 fs/btrfs/raid-stripe-tree.c >> create mode 100644 fs/btrfs/raid-stripe-tree.h >> > > > So if we choose to go with raid stripe tree this means we won't need the > raid56j code that Qu is working on ? So it's important that these two > work streams are synced so we don't duplicate effort, right? > That's the reason for my early RFC here. I think both solutions have benefits and drawbacks. The stripe tree adds complexity, metadata (though at the moment only 16 bytes per drive in the stripe per extent) and another address translation / lookup layer, it adds the benefit of being always able to do CoW and close the write-hole here. Also it can work with zoned devices and the Zone Append write command. The raid56j code will be simpler in the end I suspect, but it still doesn't do full CoW and isn't Zone Append capable. Two factors that can't work on zoned filesystems. And given that capacity drives will likely be more and more zoned drives, even outside of the hyperscale sector, I see this problematic. Both Qu and I are aware of each others patches and I would really like to get the work converged here. The raid56j code for sure is a stop gap solution for the users that already have a raid56 setup and want to get rid of the write hole. Thanks, Johannes
On 17/05/2022 09:32, Qu Wenruo wrote: > > > On 2022/5/17 15:23, Nikolay Borisov wrote: >> >> >> On 16.05.22 г. 17:31 ч., Johannes Thumshirn wrote: >>> Introduce a raid-stripe-tree to record writes in a RAID environment. >>> >>> In essence this adds another address translation layer between the >>> logical >>> and the physical addresses in btrfs and is designed to close two gaps. >>> The >>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >>> second one is the inability of doing RAID with zoned block devices due >>> to the >>> constraints we have with REQ_OP_ZONE_APPEND writes. >>> >>> Thsi is an RFC/PoC only which just shows how the code will look like >>> for a >>> zoned RAID1. Its sole purpose is to facilitate design reviews and is not >>> intended to be merged yet. Or if merged to be used on an actual >>> file-system. >>> >>> Johannes Thumshirn (8): >>> btrfs: add raid stripe tree definitions >>> btrfs: move btrfs_io_context to volumes.h >>> btrfs: read raid-stripe-tree from disk >>> btrfs: add boilerplate code to insert raid extent >>> btrfs: add code to delete raid extent >>> btrfs: add code to read raid extent >>> btrfs: zoned: allow zoned RAID1 >>> btrfs: add raid stripe tree pretty printer >>> >>> fs/btrfs/Makefile | 2 +- >>> fs/btrfs/ctree.c | 1 + >>> fs/btrfs/ctree.h | 29 ++++ >>> fs/btrfs/disk-io.c | 12 ++ >>> fs/btrfs/extent-tree.c | 9 ++ >>> fs/btrfs/file.c | 1 - >>> fs/btrfs/print-tree.c | 21 +++ >>> fs/btrfs/raid-stripe-tree.c | 251 ++++++++++++++++++++++++++++++++ >>> fs/btrfs/raid-stripe-tree.h | 39 +++++ >>> fs/btrfs/volumes.c | 44 +++++- >>> fs/btrfs/volumes.h | 93 ++++++------ >>> fs/btrfs/zoned.c | 39 +++++ >>> include/uapi/linux/btrfs.h | 1 + >>> include/uapi/linux/btrfs_tree.h | 17 +++ >>> 14 files changed, 509 insertions(+), 50 deletions(-) >>> create mode 100644 fs/btrfs/raid-stripe-tree.c >>> create mode 100644 fs/btrfs/raid-stripe-tree.h >>> >> >> >> So if we choose to go with raid stripe tree this means we won't need the >> raid56j code that Qu is working on ? So it's important that these two >> work streams are synced so we don't duplicate effort, right? > > I believe the stripe tree is going to change the definition of RAID56. > > It's no longer strict RAID56, as it doesn't contain the fixed device > rotation, thus it's kinda between RAID4 and RAID5. Well I think it can still contain the device rotation. The stripe tree only records the on-disk location of each sub-stripe, after it has been written. The data placement itself doesn't get changed at all. But for this to work, there's still a lot to do. There's also other plans I have. IIUC btrfs raid56 uses all available drives in a raid set, while raid1,10,0 etc permute the drives the data is placed. Which is a way better solution IMHO as it reduces rebuild stress in case we need to do rebuild. Given we have two digit TB drives these days, rebuilds do a lot of IO which can cause more drives failing while rebuilding. > Personally speaking, I think both features can co-exist, especially the > raid56 stripe tree may need extra development and review, since the > extra translation layer is a completely different monster when comes to > RAID56. > > Don't get me wrong, I like stripe-tree too, the only problem is it's > just too new, thus we may want a backup plan. > Exactly, as I already wrote to Nikolay, raid56j is for sure the simpler solution and some users might even prefer it for this reason. Byte, Johannes
On 2022/5/16 22:31, Johannes Thumshirn wrote: > Introduce a raid-stripe-tree to record writes in a RAID environment. > > In essence this adds another address translation layer between the logical > and the physical addresses in btrfs and is designed to close two gaps. The > first is the ominous RAID-write-hole we suffer from with RAID5/6 and the > second one is the inability of doing RAID with zoned block devices due to the > constraints we have with REQ_OP_ZONE_APPEND writes. Here I want to discuss about something related to RAID56 and RST. One of my long existing concern is, P/Q stripes have a higher update frequency, thus with certain transaction commit/data writeback timing, wouldn't it cause the device storing P/Q stripes go out of space before the data stripe devices? One example is like this, we have 3 disks RAID5, with RST and zoned allocator (allocated logical bytenr can only go forward): 0 32K 64K Disk 1 | | (data stripe) Disk 2 | | (data stripe) Disk 3 | | (parity stripe) And initially, all the zones in those disks are empty, and their write pointer are all at the beginning of the zone. (all data) Then we write 0~4K in the range, and write back happens immediate (can be DIO or sync). We need to write the 0~4K back to disk 1, and update P for that vertical stripe, right? So we got: 0 32K 64K Disk 1 |X | (data stripe) Disk 2 | | (data stripe) Disk 3 |X | (parity stripe) Then we write into 4~8K range, and sync immedately. If we go C0W for the P (we have to anyway), so what we got is: 0 32K 64K Disk 1 |X | (data stripe) Disk 2 |X | (data stripe) Disk 3 |XX | (parity stripe) So now, you can see disk3 (the zone handling parity) has its writer pointer moved 8K forward, but both data stripe zone only has its writer pointer moved 4K forward. If we go forward like this, always 4K write and sync, we will hit the following case eventually: 0 32K 64K Disk 1 |XXXXXXXXXXXXXXX | (data stripe) Disk 2 |XXXXXXXXXXXXXXX | (data stripe) Disk 3 |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| (parity stripe) The extent allocator should still think we have 64K free space to write, as we only really have written 64K. But the zone for parity stripe is already exhausted. How could we handle such case? As RAID0/1 shouldn't have such problem at all, the imbalance is purely caused by the fact that CoWing P/Q will cause higher write frequency. Thanks, Qu > > Thsi is an RFC/PoC only which just shows how the code will look like for a > zoned RAID1. Its sole purpose is to facilitate design reviews and is not > intended to be merged yet. Or if merged to be used on an actual file-system. > > Johannes Thumshirn (8): > btrfs: add raid stripe tree definitions > btrfs: move btrfs_io_context to volumes.h > btrfs: read raid-stripe-tree from disk > btrfs: add boilerplate code to insert raid extent > btrfs: add code to delete raid extent > btrfs: add code to read raid extent > btrfs: zoned: allow zoned RAID1 > btrfs: add raid stripe tree pretty printer > > fs/btrfs/Makefile | 2 +- > fs/btrfs/ctree.c | 1 + > fs/btrfs/ctree.h | 29 ++++ > fs/btrfs/disk-io.c | 12 ++ > fs/btrfs/extent-tree.c | 9 ++ > fs/btrfs/file.c | 1 - > fs/btrfs/print-tree.c | 21 +++ > fs/btrfs/raid-stripe-tree.c | 251 ++++++++++++++++++++++++++++++++ > fs/btrfs/raid-stripe-tree.h | 39 +++++ > fs/btrfs/volumes.c | 44 +++++- > fs/btrfs/volumes.h | 93 ++++++------ > fs/btrfs/zoned.c | 39 +++++ > include/uapi/linux/btrfs.h | 1 + > include/uapi/linux/btrfs_tree.h | 17 +++ > 14 files changed, 509 insertions(+), 50 deletions(-) > create mode 100644 fs/btrfs/raid-stripe-tree.c > create mode 100644 fs/btrfs/raid-stripe-tree.h >
On 13.07.22 12:54, Qu Wenruo wrote: > > > On 2022/5/16 22:31, Johannes Thumshirn wrote: >> Introduce a raid-stripe-tree to record writes in a RAID environment. >> >> In essence this adds another address translation layer between the logical >> and the physical addresses in btrfs and is designed to close two gaps. The >> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >> second one is the inability of doing RAID with zoned block devices due to the >> constraints we have with REQ_OP_ZONE_APPEND writes. > > Here I want to discuss about something related to RAID56 and RST. > > One of my long existing concern is, P/Q stripes have a higher update > frequency, thus with certain transaction commit/data writeback timing, > wouldn't it cause the device storing P/Q stripes go out of space before > the data stripe devices? P/Q stripes on a dedicated drive would be RAID4, which we don't have. > > One example is like this, we have 3 disks RAID5, with RST and zoned > allocator (allocated logical bytenr can only go forward): > > 0 32K 64K > Disk 1 | | (data stripe) > Disk 2 | | (data stripe) > Disk 3 | | (parity stripe) > > And initially, all the zones in those disks are empty, and their write > pointer are all at the beginning of the zone. (all data) > > Then we write 0~4K in the range, and write back happens immediate (can > be DIO or sync). > > We need to write the 0~4K back to disk 1, and update P for that vertical > stripe, right? So we got: > > 0 32K 64K > Disk 1 |X | (data stripe) > Disk 2 | | (data stripe) > Disk 3 |X | (parity stripe) > > Then we write into 4~8K range, and sync immedately. > > If we go C0W for the P (we have to anyway), so what we got is: > > 0 32K 64K > Disk 1 |X | (data stripe) > Disk 2 |X | (data stripe) > Disk 3 |XX | (parity stripe) > > So now, you can see disk3 (the zone handling parity) has its writer > pointer moved 8K forward, but both data stripe zone only has its writer > pointer moved 4K forward. > > If we go forward like this, always 4K write and sync, we will hit the > following case eventually: > > 0 32K 64K > Disk 1 |XXXXXXXXXXXXXXX | (data stripe) > Disk 2 |XXXXXXXXXXXXXXX | (data stripe) > Disk 3 |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| (parity stripe) > > The extent allocator should still think we have 64K free space to write, > as we only really have written 64K. > > But the zone for parity stripe is already exhausted. > > How could we handle such case? > As RAID0/1 shouldn't have such problem at all, the imbalance is purely > caused by the fact that CoWing P/Q will cause higher write frequency. > Then the a new zone for the parity stripe has to be allocated, and the old one gets reclaimed. That's nothing new. Of cause there's some gotchas in the extent allocator and the active zone management we need to consider, but over all I do not see where the blocker is here.
On 2022/7/13 19:43, Johannes Thumshirn wrote: > On 13.07.22 12:54, Qu Wenruo wrote: >> >> >> On 2022/5/16 22:31, Johannes Thumshirn wrote: >>> Introduce a raid-stripe-tree to record writes in a RAID environment. >>> >>> In essence this adds another address translation layer between the logical >>> and the physical addresses in btrfs and is designed to close two gaps. The >>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >>> second one is the inability of doing RAID with zoned block devices due to the >>> constraints we have with REQ_OP_ZONE_APPEND writes. >> >> Here I want to discuss about something related to RAID56 and RST. >> >> One of my long existing concern is, P/Q stripes have a higher update >> frequency, thus with certain transaction commit/data writeback timing, >> wouldn't it cause the device storing P/Q stripes go out of space before >> the data stripe devices? > > P/Q stripes on a dedicated drive would be RAID4, which we don't have. I'm just using one block group as an example. Sure, the next bg can definitely go somewhere else. But inside one bg, we are still using one zone for the bg, right? > >> >> One example is like this, we have 3 disks RAID5, with RST and zoned >> allocator (allocated logical bytenr can only go forward): >> >> 0 32K 64K >> Disk 1 | | (data stripe) >> Disk 2 | | (data stripe) >> Disk 3 | | (parity stripe) >> >> And initially, all the zones in those disks are empty, and their write >> pointer are all at the beginning of the zone. (all data) >> >> Then we write 0~4K in the range, and write back happens immediate (can >> be DIO or sync). >> >> We need to write the 0~4K back to disk 1, and update P for that vertical >> stripe, right? So we got: >> >> 0 32K 64K >> Disk 1 |X | (data stripe) >> Disk 2 | | (data stripe) >> Disk 3 |X | (parity stripe) >> >> Then we write into 4~8K range, and sync immedately. >> >> If we go C0W for the P (we have to anyway), so what we got is: >> >> 0 32K 64K >> Disk 1 |X | (data stripe) >> Disk 2 |X | (data stripe) >> Disk 3 |XX | (parity stripe) >> >> So now, you can see disk3 (the zone handling parity) has its writer >> pointer moved 8K forward, but both data stripe zone only has its writer >> pointer moved 4K forward. >> >> If we go forward like this, always 4K write and sync, we will hit the >> following case eventually: >> >> 0 32K 64K >> Disk 1 |XXXXXXXXXXXXXXX | (data stripe) >> Disk 2 |XXXXXXXXXXXXXXX | (data stripe) >> Disk 3 |XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| (parity stripe) >> >> The extent allocator should still think we have 64K free space to write, >> as we only really have written 64K. >> >> But the zone for parity stripe is already exhausted. >> >> How could we handle such case? >> As RAID0/1 shouldn't have such problem at all, the imbalance is purely >> caused by the fact that CoWing P/Q will cause higher write frequency. >> > > Then the a new zone for the parity stripe has to be allocated, and the old one > gets reclaimed. That's nothing new. Of cause there's some gotchas in the extent > allocator and the active zone management we need to consider, but over all I do > not see where the blocker is here. The problem is, we can not reclaim the existing full parity zone yet. We still have parity for the above 32K in that zone. So that zone can not be reclaimed, until both data stripe zoned are claimed. This means, we can have a case, that all data stripes are in the above cases, and need twice the amount of parity zones. And in that case, I'm not sure if our chunk allocator can handle it properly, but at least our free space estimation is not accurate. Thanks, Qu
On 13.07.22 14:01, Qu Wenruo wrote: > > > On 2022/7/13 19:43, Johannes Thumshirn wrote: >> On 13.07.22 12:54, Qu Wenruo wrote: >>> >>> >>> On 2022/5/16 22:31, Johannes Thumshirn wrote: >>>> Introduce a raid-stripe-tree to record writes in a RAID environment. >>>> >>>> In essence this adds another address translation layer between the logical >>>> and the physical addresses in btrfs and is designed to close two gaps. The >>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >>>> second one is the inability of doing RAID with zoned block devices due to the >>>> constraints we have with REQ_OP_ZONE_APPEND writes. >>> >>> Here I want to discuss about something related to RAID56 and RST. >>> >>> One of my long existing concern is, P/Q stripes have a higher update >>> frequency, thus with certain transaction commit/data writeback timing, >>> wouldn't it cause the device storing P/Q stripes go out of space before >>> the data stripe devices? >> >> P/Q stripes on a dedicated drive would be RAID4, which we don't have. > > I'm just using one block group as an example. > > Sure, the next bg can definitely go somewhere else. > > But inside one bg, we are still using one zone for the bg, right? Ok maybe I'm not understanding the code in volumes.c correctly, but doesn't __btrfs_map_block() calculate a rotation per stripe-set? I'm looking at this code: /* Build raid_map */ if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map && (need_full_stripe(op) || mirror_num > 1)) { u64 tmp; unsigned rot; /* Work out the disk rotation on this stripe-set */ div_u64_rem(stripe_nr, num_stripes, &rot); /* Fill in the logical address of each stripe */ tmp = stripe_nr * data_stripes; for (i = 0; i < data_stripes; i++) bioc->raid_map[(i + rot) % num_stripes] = em->start + (tmp + i) * map->stripe_len; bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE; if (map->type & BTRFS_BLOCK_GROUP_RAID6) bioc->raid_map[(i + rot + 1) % num_stripes] = RAID6_Q_STRIPE; sort_parity_stripes(bioc, num_stripes); } So then in your example we have something like this: Write of 4k D1: 0 32K 64K Disk 1 |D1 | Disk 2 | | Disk 3 |P1 | Write of 4k D2, the new parity is P2 the old P1 parity is obsolete 0 32K 64K Disk 1 |D1 | Disk 2 |P2 | Disk 3 |P1D2 | Write of new 4k D1 with P3 0 32K 64K Disk 1 |D1P3 | Disk 2 |P2D1 | Disk 3 |P1D2 | and so on.
On 2022/7/13 20:42, Johannes Thumshirn wrote: > On 13.07.22 14:01, Qu Wenruo wrote: >> >> >> On 2022/7/13 19:43, Johannes Thumshirn wrote: >>> On 13.07.22 12:54, Qu Wenruo wrote: >>>> >>>> >>>> On 2022/5/16 22:31, Johannes Thumshirn wrote: >>>>> Introduce a raid-stripe-tree to record writes in a RAID environment. >>>>> >>>>> In essence this adds another address translation layer between the logical >>>>> and the physical addresses in btrfs and is designed to close two gaps. The >>>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >>>>> second one is the inability of doing RAID with zoned block devices due to the >>>>> constraints we have with REQ_OP_ZONE_APPEND writes. >>>> >>>> Here I want to discuss about something related to RAID56 and RST. >>>> >>>> One of my long existing concern is, P/Q stripes have a higher update >>>> frequency, thus with certain transaction commit/data writeback timing, >>>> wouldn't it cause the device storing P/Q stripes go out of space before >>>> the data stripe devices? >>> >>> P/Q stripes on a dedicated drive would be RAID4, which we don't have. >> >> I'm just using one block group as an example. >> >> Sure, the next bg can definitely go somewhere else. >> >> But inside one bg, we are still using one zone for the bg, right? > > Ok maybe I'm not understanding the code in volumes.c correctly, but > doesn't __btrfs_map_block() calculate a rotation per stripe-set? > > I'm looking at this code: > > /* Build raid_map */ > if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map && > (need_full_stripe(op) || mirror_num > 1)) { > u64 tmp; > unsigned rot; > > /* Work out the disk rotation on this stripe-set */ > div_u64_rem(stripe_nr, num_stripes, &rot); > > /* Fill in the logical address of each stripe */ > tmp = stripe_nr * data_stripes; > for (i = 0; i < data_stripes; i++) > bioc->raid_map[(i + rot) % num_stripes] = > em->start + (tmp + i) * map->stripe_len; > > bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE; > if (map->type & BTRFS_BLOCK_GROUP_RAID6) > bioc->raid_map[(i + rot + 1) % num_stripes] = > RAID6_Q_STRIPE; > > sort_parity_stripes(bioc, num_stripes); > } That's per full-stripe. AKA, the rotation only kicks in after a full stripe. In my example, we're inside one full stripe, no rotation, until next full stripe. > > > So then in your example we have something like this: > > Write of 4k D1: > > 0 32K 64K > Disk 1 |D1 | > Disk 2 | | > Disk 3 |P1 | > > > Write of 4k D2, the new parity is P2 the old P1 parity is obsolete > > 0 32K 64K > Disk 1 |D1 | > Disk 2 |P2 | > Disk 3 |P1D2 | > > Write of new 4k D1 with P3 > > 0 32K 64K > Disk 1 |D1P3 | > Disk 2 |P2D1 | > Disk 3 |P1D2 | > > and so on. So, not the case, at least not in the full stripe. Thanks, Qu
On 13.07.22 15:47, Qu Wenruo wrote: > > > On 2022/7/13 20:42, Johannes Thumshirn wrote: >> On 13.07.22 14:01, Qu Wenruo wrote: >>> >>> >>> On 2022/7/13 19:43, Johannes Thumshirn wrote: >>>> On 13.07.22 12:54, Qu Wenruo wrote: >>>>> >>>>> >>>>> On 2022/5/16 22:31, Johannes Thumshirn wrote: >>>>>> Introduce a raid-stripe-tree to record writes in a RAID environment. >>>>>> >>>>>> In essence this adds another address translation layer between the logical >>>>>> and the physical addresses in btrfs and is designed to close two gaps. The >>>>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the >>>>>> second one is the inability of doing RAID with zoned block devices due to the >>>>>> constraints we have with REQ_OP_ZONE_APPEND writes. >>>>> >>>>> Here I want to discuss about something related to RAID56 and RST. >>>>> >>>>> One of my long existing concern is, P/Q stripes have a higher update >>>>> frequency, thus with certain transaction commit/data writeback timing, >>>>> wouldn't it cause the device storing P/Q stripes go out of space before >>>>> the data stripe devices? >>>> >>>> P/Q stripes on a dedicated drive would be RAID4, which we don't have. >>> >>> I'm just using one block group as an example. >>> >>> Sure, the next bg can definitely go somewhere else. >>> >>> But inside one bg, we are still using one zone for the bg, right? >> >> Ok maybe I'm not understanding the code in volumes.c correctly, but >> doesn't __btrfs_map_block() calculate a rotation per stripe-set? >> >> I'm looking at this code: >> >> /* Build raid_map */ >> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map && >> (need_full_stripe(op) || mirror_num > 1)) { >> u64 tmp; >> unsigned rot; >> >> /* Work out the disk rotation on this stripe-set */ >> div_u64_rem(stripe_nr, num_stripes, &rot); >> >> /* Fill in the logical address of each stripe */ >> tmp = stripe_nr * data_stripes; >> for (i = 0; i < data_stripes; i++) >> bioc->raid_map[(i + rot) % num_stripes] = >> em->start + (tmp + i) * map->stripe_len; >> >> bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE; >> if (map->type & BTRFS_BLOCK_GROUP_RAID6) >> bioc->raid_map[(i + rot + 1) % num_stripes] = >> RAID6_Q_STRIPE; >> >> sort_parity_stripes(bioc, num_stripes); >> } > > That's per full-stripe. AKA, the rotation only kicks in after a full stripe. > > In my example, we're inside one full stripe, no rotation, until next > full stripe. > Ah ok, my apologies. For sub-stripe size writes My idea was to 0-pad up to stripe size. Then we can do full CoW of stripes. If we have an older generation of a stripe, we can just override it on regular btrfs. On zoned btrfs this just accounts for more zone_unusable bytes and waits for the GC to kick in.
On Wed, 13 Jul 2022 14:01:32 +0000 Johannes Thumshirn <Johannes.Thumshirn@wdc.com> wrote: > On 13.07.22 15:47, Qu Wenruo wrote: > > > > > > On 2022/7/13 20:42, Johannes Thumshirn wrote: > >> On 13.07.22 14:01, Qu Wenruo wrote: > >>> > >>> > >>> On 2022/7/13 19:43, Johannes Thumshirn wrote: > >>>> On 13.07.22 12:54, Qu Wenruo wrote: > >>>>> > >>>>> > >>>>> On 2022/5/16 22:31, Johannes Thumshirn wrote: > >>>>>> Introduce a raid-stripe-tree to record writes in a RAID environment. > >>>>>> > >>>>>> In essence this adds another address translation layer between the logical > >>>>>> and the physical addresses in btrfs and is designed to close two gaps. The > >>>>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the > >>>>>> second one is the inability of doing RAID with zoned block devices due to the > >>>>>> constraints we have with REQ_OP_ZONE_APPEND writes. > >>>>> > >>>>> Here I want to discuss about something related to RAID56 and RST. > >>>>> > >>>>> One of my long existing concern is, P/Q stripes have a higher update > >>>>> frequency, thus with certain transaction commit/data writeback timing, > >>>>> wouldn't it cause the device storing P/Q stripes go out of space before > >>>>> the data stripe devices? > >>>> > >>>> P/Q stripes on a dedicated drive would be RAID4, which we don't have. > >>> > >>> I'm just using one block group as an example. > >>> > >>> Sure, the next bg can definitely go somewhere else. > >>> > >>> But inside one bg, we are still using one zone for the bg, right? > >> > >> Ok maybe I'm not understanding the code in volumes.c correctly, but > >> doesn't __btrfs_map_block() calculate a rotation per stripe-set? > >> > >> I'm looking at this code: > >> > >> /* Build raid_map */ > >> if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK && need_raid_map && > >> (need_full_stripe(op) || mirror_num > 1)) { > >> u64 tmp; > >> unsigned rot; > >> > >> /* Work out the disk rotation on this stripe-set */ > >> div_u64_rem(stripe_nr, num_stripes, &rot); > >> > >> /* Fill in the logical address of each stripe */ > >> tmp = stripe_nr * data_stripes; > >> for (i = 0; i < data_stripes; i++) > >> bioc->raid_map[(i + rot) % num_stripes] = > >> em->start + (tmp + i) * map->stripe_len; > >> > >> bioc->raid_map[(i + rot) % map->num_stripes] = RAID5_P_STRIPE; > >> if (map->type & BTRFS_BLOCK_GROUP_RAID6) > >> bioc->raid_map[(i + rot + 1) % num_stripes] = > >> RAID6_Q_STRIPE; > >> > >> sort_parity_stripes(bioc, num_stripes); > >> } > > > > That's per full-stripe. AKA, the rotation only kicks in after a full stripe. > > > > In my example, we're inside one full stripe, no rotation, until next > > full stripe. > > > > > Ah ok, my apologies. For sub-stripe size writes My idea was to 0-pad up to > stripe size. Then we can do full CoW of stripes. If we have an older generation > of a stripe, we can just override it on regular btrfs. On zoned btrfs this > just accounts for more zone_unusable bytes and waits for the GC to kick in. > Have you considered variable stripe size? I believe ZFS does this. Should be easy for raid5 since it's just xor, not sure for raid6. PS: ZFS seems to do variable-_width_ stripes https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/ Regards, Lukas Straub --
On 13.07.22 17:25, Lukas Straub wrote: > > Have you considered variable stripe size? I believe ZFS does this. > Should be easy for raid5 since it's just xor, not sure for raid6. > > PS: ZFS seems to do variable-_width_ stripes > https://pthree.org/2012/12/05/zfs-administration-part-ii-raidz/ I did and coincidentally we have been talking about it just 5 minutes ago and both David and Chris aren't very fond of the idea.
On 2022/7/13 22:01, Johannes Thumshirn wrote: > On 13.07.22 15:47, Qu Wenruo wrote: > > > Ah ok, my apologies. For sub-stripe size writes My idea was to 0-pad up to > stripe size. Then we can do full CoW of stripes. If we have an older generation > of a stripe, we can just override it on regular btrfs. On zoned btrfs this > just accounts for more zone_unusable bytes and waits for the GC to kick in. > Sorry, I guess you still didn't get my point here. What I'm talking about is, how many bytes you can really write into a full stripe when CoWing P/Q stripes. [TL;DR] If we CoW P/Q, for the worst cases (always 4K write and sync), the space efficiency is no better than RAID1. For a lot of write order, we can only write 64K (STRIPE_LEN) no matter what. !NOTE! All following examples are using 8KiB sector size, to make the graph shorter. [CASE 1 CURRENT WRITE ORDER, NO PADDING] 0 64K Disk 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | (Data stripe) Disk 2 | 8 | 9 | a | b | c | d | e | f | (Data stripe) Disk 3 | P | P | P | P | P | P | P | P | (Parity stripe). For zoned RST, we can only write 8 sectors, then Disk 3 exhaust its zone. As every time we write a sector in data stripe, we have to write a P. Total written bytes: 64K Expected written bytes: 128K (nr_data * 64K) Efficiency: 1 / nr_data. The worst. [CASE 2 CURRENT WRITE ORDER, PADDING] No difference than case 1, just when we have finished sector 7, all zones are exhausted. Total written bytes: 64K Expected written bytes: 128K (nr_data * 64K) Efficiency: 1 / nr_data. [CASE 3 FULLY UNORDERED, NO PADDING] This should have the best efficiency, but no better than RAID1. 0 64K Disk 1 | 0 | P | 3 | P | 6 | P | 9 | P | Disk 2 | P | 2 | P | 5 | P | 8 | P | b | Disk 3 | 1 | P | 4 | P | 7 | P | a | P | Total written bytes: 96K Expected written bytes: 128K (nr_data * 64K) Efficiency: 1 / 2 This can not even beat RAID1/RAID10, but cause way more metadata just for the RST. Whatever the case, we can no longer ensure we can write (nr_data * 64K) bytes of data into a full stripe. And for worst cases, it can be way bad than RAID1, I don't really think it's any good to our extent allocator or the space efficiency (that's exactly why users choose to go RAID56). [ROOT CAUSE] If we just check how many write we really need submit to each device, it should be obvious: When data stripe in disk1 is filled: 0 64K Disk 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 64K written Disk 2 | | | | | | | | | 0 written Disk 3 | P | P | P | P | P | P | P | P | 64K written When data stripe in disk2 is filled: 0 64K Disk 1 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 64K written Disk 2 | 8 | 9 | a | b | c | d | e | f | 64K written Disk 3 | P'| P'| P'| P'| P'| P'| P'| P'| 128K written For RAID56 partial write, the total write is always 2 * data written. Thus for zoned device, since they can not do any overwrite, their worst case space efficiency can never exceed RAID1. Thus I have repeated times and times, against this problem for RST. Thanks, Qu
On 14.07.22 03:08, Qu Wenruo wrote:> [CASE 2 CURRENT WRITE ORDER, PADDING> No difference than case 1, just when we have finished sector 7, all > zones are exhausted.>> Total written bytes: 64K> Expected written bytes: 128K (nr_data * 64K)> Efficiency: 1 / nr_data.> I'm sorry but I have to disagree. If we're writing less than 64k, everything beyond these 64k will get filled up with 0 0 64K Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) Disk 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) Disk 3 | P | P | P | P | P | P | P | P | (Parity stripe) So the next write (the CoW) will then be: 64k 128K Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) Disk 2 | D2| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) Disk 3 | P'| P'| P'| P'| P'| P'| P'| P'| (Parity stripe) For zoned we can play this game zone_size/stripe_size times, which on a typical SMR HDD would be: 126M/64k = 4096 times until you fill up a zone. I.e. if you do stupid things you get stupid results. C'est la vie.
On 2022/7/14 15:08, Johannes Thumshirn wrote: > On 14.07.22 03:08, Qu Wenruo wrote:> [CASE 2 CURRENT WRITE ORDER, PADDING> No difference than case 1, just when we have finished sector 7, all > zones are exhausted.>> Total written bytes: 64K> Expected written bytes: 128K (nr_data * 64K)> Efficiency: 1 / nr_data.> > I'm sorry but I have to disagree. > If we're writing less than 64k, everything beyond these 64k will get filled up with 0 > > 0 64K > Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) > Disk 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) > Disk 3 | P | P | P | P | P | P | P | P | (Parity stripe) > > So the next write (the CoW) will then be: > > 64k 128K > Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) > Disk 2 | D2| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) > Disk 3 | P'| P'| P'| P'| P'| P'| P'| P'| (Parity stripe) Nope, currently full stripe write should still go into disk1, not disk 2. Sorry I did use a bad example from the very beginning. In that case, what we should have is: 0 64K Disk 1 | D1| D2| 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) Disk 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) Disk 3 | P | P | 0 | 0 | 0 | 0 | 0 | 0 | (Parity stripe) In that case, Parity should still needs two blocks. And when Disk 1 get filled up, we have no way to write into Disk 2. > > For zoned we can play this game zone_size/stripe_size times, which on a typical > SMR HDD would be: > > 126M/64k = 4096 times until you fill up a zone. No difference. You have extra zone to use, but the result is, the space efficiency will not be better than RAID1 for the worst case. > > I.e. if you do stupid things you get stupid results. C'est la vie. > You still didn't answer the space efficient problem. RAID56 really rely on overwrite on its P/Q stripes. The total write amount is really twice the data writes, that's something you can not avoid. Thanks, Qu
On 14.07.22 09:32, Qu Wenruo wrote: > > > On 2022/7/14 15:08, Johannes Thumshirn wrote: >> On 14.07.22 03:08, Qu Wenruo wrote:> [CASE 2 CURRENT WRITE ORDER, PADDING> No difference than case 1, just when we have finished sector 7, all > zones are exhausted.>> Total written bytes: 64K> Expected written bytes: 128K (nr_data * 64K)> Efficiency: 1 / nr_data.> >> I'm sorry but I have to disagree. >> If we're writing less than 64k, everything beyond these 64k will get filled up with 0 >> >> 0 64K >> Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >> Disk 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >> Disk 3 | P | P | P | P | P | P | P | P | (Parity stripe) >> >> So the next write (the CoW) will then be: >> >> 64k 128K >> Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >> Disk 2 | D2| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >> Disk 3 | P'| P'| P'| P'| P'| P'| P'| P'| (Parity stripe) > > Nope, currently full stripe write should still go into disk1, not disk 2. > Sorry I did use a bad example from the very beginning. > > In that case, what we should have is: > > 0 64K > Disk 1 | D1| D2| 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) > Disk 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) > Disk 3 | P | P | 0 | 0 | 0 | 0 | 0 | 0 | (Parity stripe) > > In that case, Parity should still needs two blocks. > > And when Disk 1 get filled up, we have no way to write into Disk 2. > >> >> For zoned we can play this game zone_size/stripe_size times, which on a typical >> SMR HDD would be: >> >> 126M/64k = 4096 times until you fill up a zone. > > No difference. > > You have extra zone to use, but the result is, the space efficiency will > not be better than RAID1 for the worst case. > >> >> I.e. if you do stupid things you get stupid results. C'est la vie. >> > > You still didn't answer the space efficient problem. > > RAID56 really rely on overwrite on its P/Q stripes. Nope, btrfs raid56 does this. Another implementation could for instance buffer each stripe in an NVRAM (like described in [1]), or like Chris suggested in a RAID1 area on the drives, or doing variable stripe length like ZFS' RAID-Z, and so on. > The total write amount is really twice the data writes, that's something > you can not avoid. > Again if you're doing sub-stripe size writes, you're asking stupid things and then there's no reason to not give the user stupid answers. If a user is concerned about the write or space amplicfication of sub-stripe writes on RAID56 he/she really needs to rethink the architecture. [1] S. K. Mishra and P. Mohapatra, "Performance study of RAID-5 disk arrays with data and parity cache," Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164.
On 2022/7/14 15:46, Johannes Thumshirn wrote: > On 14.07.22 09:32, Qu Wenruo wrote: >> >> >> On 2022/7/14 15:08, Johannes Thumshirn wrote: >>> On 14.07.22 03:08, Qu Wenruo wrote:> [CASE 2 CURRENT WRITE ORDER, PADDING> No difference than case 1, just when we have finished sector 7, all > zones are exhausted.>> Total written bytes: 64K> Expected written bytes: 128K (nr_data * 64K)> Efficiency: 1 / nr_data.> >>> I'm sorry but I have to disagree. >>> If we're writing less than 64k, everything beyond these 64k will get filled up with 0 >>> >>> 0 64K >>> Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >>> Disk 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >>> Disk 3 | P | P | P | P | P | P | P | P | (Parity stripe) >>> >>> So the next write (the CoW) will then be: >>> >>> 64k 128K >>> Disk 1 | D1| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >>> Disk 2 | D2| 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >>> Disk 3 | P'| P'| P'| P'| P'| P'| P'| P'| (Parity stripe) >> >> Nope, currently full stripe write should still go into disk1, not disk 2. >> Sorry I did use a bad example from the very beginning. >> >> In that case, what we should have is: >> >> 0 64K >> Disk 1 | D1| D2| 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >> Disk 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | (Data stripe) >> Disk 3 | P | P | 0 | 0 | 0 | 0 | 0 | 0 | (Parity stripe) >> >> In that case, Parity should still needs two blocks. >> >> And when Disk 1 get filled up, we have no way to write into Disk 2. >> >>> >>> For zoned we can play this game zone_size/stripe_size times, which on a typical >>> SMR HDD would be: >>> >>> 126M/64k = 4096 times until you fill up a zone. >> >> No difference. >> >> You have extra zone to use, but the result is, the space efficiency will >> not be better than RAID1 for the worst case. >> >>> >>> I.e. if you do stupid things you get stupid results. C'est la vie. >>> >> >> You still didn't answer the space efficient problem. >> >> RAID56 really rely on overwrite on its P/Q stripes. > > Nope, btrfs raid56 does this. Another implementation could for instance > buffer each stripe in an NVRAM (like described in [1]), or like Chris > suggested in a RAID1 area on the drives, or doing variable stripe length > like ZFS' RAID-Z, and so on. Not only btrfs raid56, but also dm-raid56 also do this. And what you mention is just an variant of journal, delay the write until got a full stripe. > >> The total write amount is really twice the data writes, that's something >> you can not avoid. >> > > Again if you're doing sub-stripe size writes, you're asking stupid things and > then there's no reason to not give the user stupid answers. No, you can not limit what users do. As long as btrfs itself support writes in sectorsize (4K), you can not stop user doing that. In your argument, I can also say, write-intent is a problem of end users, and no need to fix at all. That's definitely not the correct way to do, let user to adapt the limitation? No, just big no. Thanks, Qu > > If a user is concerned about the write or space amplicfication of sub-stripe > writes on RAID56 he/she really needs to rethink the architecture. > > > > [1] > S. K. Mishra and P. Mohapatra, > "Performance study of RAID-5 disk arrays with data and parity cache," > Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, > 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164.
On 14/07/2022 09.46, Johannes Thumshirn wrote: > On 14.07.22 09:32, Qu Wenruo wrote: >>[...] > > Again if you're doing sub-stripe size writes, you're asking stupid things and > then there's no reason to not give the user stupid answers. > Qu is right, if we consider only full stripe write the "raid hole" problem disappear, because if a "full stripe" is not fully written it is not referenced either. Personally I think that the ZFS variable stripe size, may be interesting to evaluate. Moreover, because the BTRFS disk format is quite flexible, we can store different BG with different number of disks. Let me to make an example: if we have 10 disks, we could allocate: 1 BG RAID1 1 BG RAID5, spread over 4 disks only 1 BG RAID5, spread over 8 disks only 1 BG RAID5, spread over 10 disks So if we have short writes, we could put the extents in the RAID1 BG; for longer writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length of the data. Yes this would require a sort of garbage collector to move the data to the biggest raid5 BG, but this would avoid (or reduce) the fragmentation which affect the variable stripe size. Doing so we don't need any disk format change and it would be backward compatible. Moreover, if we could put the smaller BG in the faster disks, we could have a decent tiering.... > If a user is concerned about the write or space amplicfication of sub-stripe > writes on RAID56 he/she really needs to rethink the architecture. > > > > [1] > S. K. Mishra and P. Mohapatra, > "Performance study of RAID-5 disk arrays with data and parity cache," > Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, > 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164.
As a user of RAID6 here, let me jump in because I think this suggestion is actually a very good compromise. With stripes written only once, we completely eliminate any possible write-hole, and even without any changes on the current disk layout and allocation, there shouldn't be much wasted space (in my case, I have a 12-disk RAID6, so each full stripe holds 640kb, and discounting single-sector writes that should go into metadata space, any reasonable write should fill that buffer in a few seconds). The additional suggestion of using smaller stripe widths in case there isn't enough data to fill a whole stripe would make it very easy to reclaim the wasted space by rebalancing with a stripe count filter, which can be easily automated and run very frequently. On-disk format also wouldn't change and be fully usable by older kernels, and it should "only" require changes on the allocator to implement. On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli <kreijack@libero.it> wrote: > > On 14/07/2022 09.46, Johannes Thumshirn wrote: > > On 14.07.22 09:32, Qu Wenruo wrote: > >>[...] > > > > Again if you're doing sub-stripe size writes, you're asking stupid things and > > then there's no reason to not give the user stupid answers. > > > > Qu is right, if we consider only full stripe write the "raid hole" problem > disappear, because if a "full stripe" is not fully written it is not > referenced either. > > > Personally I think that the ZFS variable stripe size, may be interesting > to evaluate. Moreover, because the BTRFS disk format is quite flexible, > we can store different BG with different number of disks. Let me to make an > example: if we have 10 disks, we could allocate: > 1 BG RAID1 > 1 BG RAID5, spread over 4 disks only > 1 BG RAID5, spread over 8 disks only > 1 BG RAID5, spread over 10 disks > > So if we have short writes, we could put the extents in the RAID1 BG; for longer > writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length > of the data. > > Yes this would require a sort of garbage collector to move the data to the biggest > raid5 BG, but this would avoid (or reduce) the fragmentation which affect the > variable stripe size. > > Doing so we don't need any disk format change and it would be backward compatible. > > > Moreover, if we could put the smaller BG in the faster disks, we could have a > decent tiering.... > > > > If a user is concerned about the write or space amplicfication of sub-stripe > > writes on RAID56 he/she really needs to rethink the architecture. > > > > > > > > [1] > > S. K. Mishra and P. Mohapatra, > > "Performance study of RAID-5 disk arrays with data and parity cache," > > Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, > > 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >
On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli <kreijack@libero.it> wrote: > > On 14/07/2022 09.46, Johannes Thumshirn wrote: > > On 14.07.22 09:32, Qu Wenruo wrote: > >>[...] > > > > Again if you're doing sub-stripe size writes, you're asking stupid things and > > then there's no reason to not give the user stupid answers. > > > > Qu is right, if we consider only full stripe write the "raid hole" problem > disappear, because if a "full stripe" is not fully written it is not > referenced either. > > > Personally I think that the ZFS variable stripe size, may be interesting > to evaluate. Moreover, because the BTRFS disk format is quite flexible, > we can store different BG with different number of disks. Let me to make an > example: if we have 10 disks, we could allocate: > 1 BG RAID1 > 1 BG RAID5, spread over 4 disks only > 1 BG RAID5, spread over 8 disks only > 1 BG RAID5, spread over 10 disks > > So if we have short writes, we could put the extents in the RAID1 BG; for longer > writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length > of the data. > > Yes this would require a sort of garbage collector to move the data to the biggest > raid5 BG, but this would avoid (or reduce) the fragmentation which affect the > variable stripe size. > > Doing so we don't need any disk format change and it would be backward compatible. My 2 cents... Regarding the current raid56 support, in order of preference: a. Fix the current bugs, without changing format. Zygo has an extensive list. b. Mostly fix the write hole, also without changing the format, by only doing COW with full stripe writes. Yes you could somehow get corrupt parity still and not know it until degraded operation produces a bad reconstruction of data - but checksum will still catch that. This kind of "unreplicated corruption" is not quite the same thing as the write hole, because it isn't pernicious like the write hole. c. A new de-clustered parity raid56 implementation that is not backwards compatible. Ergo, I think it's best to not break the format twice. Even if a new raid implementation is years off. Metadata centric workloads suck on parity raid anyway. If Btrfs always does full stripe COW won't matter even if the performance is worse because no one should use parity raid for this workload anyway. -- Chris Murphy
On 2022/7/16 03:08, Thiago Ramon wrote: > As a user of RAID6 here, let me jump in because I think this > suggestion is actually a very good compromise. > > With stripes written only once, we completely eliminate any possible > write-hole, and even without any changes on the current disk layout > and allocation, Unfortunately current extent allocator won't understand the requirement at all. Currently the extent allocator although tends to use clustered free space, when it can not find a clustered space, it goes where it can find a free space. No matter if it's a substripe write. Thus to full stripe only write, it's really the old idea about a new extent allocator to avoid sub-stripe writes. Nowadays with the zoned code, I guess it is now more feasible than previous. Now I think it's time to revive the extent allcator idea, and explore the extent allocator based idea, at least it requires no on-disk format change, which even write-intent still needs a on-disk format change (at least needs a compat ro flag) Thanks, Qu > there shouldn't be much wasted space (in my case, I > have a 12-disk RAID6, so each full stripe holds 640kb, and discounting > single-sector writes that should go into metadata space, any > reasonable write should fill that buffer in a few seconds). > > The additional suggestion of using smaller stripe widths in case there > isn't enough data to fill a whole stripe would make it very easy to > reclaim the wasted space by rebalancing with a stripe count filter, > which can be easily automated and run very frequently. > > On-disk format also wouldn't change and be fully usable by older > kernels, and it should "only" require changes on the allocator to > implement. > > On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli <kreijack@libero.it> wrote: >> >> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>> On 14.07.22 09:32, Qu Wenruo wrote: >>>> [...] >>> >>> Again if you're doing sub-stripe size writes, you're asking stupid things and >>> then there's no reason to not give the user stupid answers. >>> >> >> Qu is right, if we consider only full stripe write the "raid hole" problem >> disappear, because if a "full stripe" is not fully written it is not >> referenced either. >> >> >> Personally I think that the ZFS variable stripe size, may be interesting >> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >> we can store different BG with different number of disks. Let me to make an >> example: if we have 10 disks, we could allocate: >> 1 BG RAID1 >> 1 BG RAID5, spread over 4 disks only >> 1 BG RAID5, spread over 8 disks only >> 1 BG RAID5, spread over 10 disks >> >> So if we have short writes, we could put the extents in the RAID1 BG; for longer >> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length >> of the data. >> >> Yes this would require a sort of garbage collector to move the data to the biggest >> raid5 BG, but this would avoid (or reduce) the fragmentation which affect the >> variable stripe size. >> >> Doing so we don't need any disk format change and it would be backward compatible. >> >> >> Moreover, if we could put the smaller BG in the faster disks, we could have a >> decent tiering.... >> >> >>> If a user is concerned about the write or space amplicfication of sub-stripe >>> writes on RAID56 he/she really needs to rethink the architecture. >>> >>> >>> >>> [1] >>> S. K. Mishra and P. Mohapatra, >>> "Performance study of RAID-5 disk arrays with data and parity cache," >>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, >>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. >> >> -- >> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>
On 2022/7/16 08:34, Qu Wenruo wrote: > > > On 2022/7/16 03:08, Thiago Ramon wrote: >> As a user of RAID6 here, let me jump in because I think this >> suggestion is actually a very good compromise. >> >> With stripes written only once, we completely eliminate any possible >> write-hole, and even without any changes on the current disk layout >> and allocation, > > Unfortunately current extent allocator won't understand the requirement > at all. > > Currently the extent allocator although tends to use clustered free > space, when it can not find a clustered space, it goes where it can find > a free space. No matter if it's a substripe write. > > > Thus to full stripe only write, it's really the old idea about a new > extent allocator to avoid sub-stripe writes. > > Nowadays with the zoned code, I guess it is now more feasible than > previous. > > Now I think it's time to revive the extent allcator idea, and explore > the extent allocator based idea, at least it requires no on-disk format > change, which even write-intent still needs a on-disk format change (at > least needs a compat ro flag) After more consideration, I am still not confident of above extent allocator avoid sub-stripe write. Especially for the following ENOSPC case (I'll later try submit it as an future proof test case for fstests). --- mkfs.btrfs -f -m raid1c3 -d raid5 $dev1 $dev2 $dev3 mount $dev1 $mnt for (( i=0;; i+=2 )) do xfs_io -f -c "pwrite 0 64k" $mnt/file.$i &> /dev/null if [ $? -ne 0 ]; then break fi xfs_io -f -c "pwrite 0 64k" $mnt/file.$(($i + 1)) &> /dev/null if [ $? -ne 0 ]; then break fi sync done rm -rf -- $mnt/file.*[02468] sync xfs_io -f -c "pwrite 0 4m" $mnt/new_file --- The core idea of above script it, fill the fs using 64K extents. Then delete half of them interleavely. This will make all the full stripes to have one data stripe fully utilize, one free, and all parity utilized. If you go extent allocator that avoid sub-stripe write, then the last write will fail. If you RST with regular devices and COWing P/Q, then the last write will also fail. To me, I don't care about performance or latency, but at least, what we can do before, but now if a new proposed RAID56 can not do, then to me it's a regression. I'm not against RST, but for RST on regular devices, we still need GC and reserved block groups to avoid above problems. And that's why I still prefer write-intent, it brings no possible regression. > >> there shouldn't be much wasted space (in my case, I >> have a 12-disk RAID6, so each full stripe holds 640kb, and discounting >> single-sector writes that should go into metadata space, any >> reasonable write should fill that buffer in a few seconds). Nope, the problem is not that simple. Consider this, you have an application doing an 64K write DIO. Then with allocator prohibiting sub-stripe write, it will take a full 640K stripe, wasting 90% of your space. Furthermore, even if you have some buffered write, merged into an 640KiB full stripe, but later 9 * 64K of data extents in that full stripe get freed. Then you can not use that 9 * 64K space anyway. That's why zoned device has GC and reserved zones. If we go allocator way, then we also need a non-zoned GC and reserved block groups. Good luck implementing that feature just for RAID56 on non-zoned devices. Thanks, Qu >> >> The additional suggestion of using smaller stripe widths in case there >> isn't enough data to fill a whole stripe would make it very easy to >> reclaim the wasted space by rebalancing with a stripe count filter, >> which can be easily automated and run very frequently. >> >> On-disk format also wouldn't change and be fully usable by older >> kernels, and it should "only" require changes on the allocator to >> implement. >> >> On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli >> <kreijack@libero.it> wrote: >>> >>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>> [...] >>>> >>>> Again if you're doing sub-stripe size writes, you're asking stupid >>>> things and >>>> then there's no reason to not give the user stupid answers. >>>> >>> >>> Qu is right, if we consider only full stripe write the "raid hole" >>> problem >>> disappear, because if a "full stripe" is not fully written it is not >>> referenced either. >>> >>> >>> Personally I think that the ZFS variable stripe size, may be interesting >>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >>> we can store different BG with different number of disks. Let me to >>> make an >>> example: if we have 10 disks, we could allocate: >>> 1 BG RAID1 >>> 1 BG RAID5, spread over 4 disks only >>> 1 BG RAID5, spread over 8 disks only >>> 1 BG RAID5, spread over 10 disks >>> >>> So if we have short writes, we could put the extents in the RAID1 BG; >>> for longer >>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by >>> length >>> of the data. >>> >>> Yes this would require a sort of garbage collector to move the data >>> to the biggest >>> raid5 BG, but this would avoid (or reduce) the fragmentation which >>> affect the >>> variable stripe size. >>> >>> Doing so we don't need any disk format change and it would be >>> backward compatible. >>> >>> >>> Moreover, if we could put the smaller BG in the faster disks, we >>> could have a >>> decent tiering.... >>> >>> >>>> If a user is concerned about the write or space amplicfication of >>>> sub-stripe >>>> writes on RAID56 he/she really needs to rethink the architecture. >>>> >>>> >>>> >>>> [1] >>>> S. K. Mishra and P. Mohapatra, >>>> "Performance study of RAID-5 disk arrays with data and parity cache," >>>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel >>>> Processing, >>>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. >>> >>> -- >>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>>
On Sat, Jul 16, 2022 at 8:12 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2022/7/16 08:34, Qu Wenruo wrote: > > > > > > On 2022/7/16 03:08, Thiago Ramon wrote: > >> As a user of RAID6 here, let me jump in because I think this > >> suggestion is actually a very good compromise. > >> > >> With stripes written only once, we completely eliminate any possible > >> write-hole, and even without any changes on the current disk layout > >> and allocation, > > > > Unfortunately current extent allocator won't understand the requirement > > at all. > > > > Currently the extent allocator although tends to use clustered free > > space, when it can not find a clustered space, it goes where it can find > > a free space. No matter if it's a substripe write. > > > > > > Thus to full stripe only write, it's really the old idea about a new > > extent allocator to avoid sub-stripe writes. > > > > Nowadays with the zoned code, I guess it is now more feasible than > > previous. > > > > Now I think it's time to revive the extent allcator idea, and explore > > the extent allocator based idea, at least it requires no on-disk format > > change, which even write-intent still needs a on-disk format change (at > > least needs a compat ro flag) > > After more consideration, I am still not confident of above extent > allocator avoid sub-stripe write. > > Especially for the following ENOSPC case (I'll later try submit it as an > future proof test case for fstests). > > --- > mkfs.btrfs -f -m raid1c3 -d raid5 $dev1 $dev2 $dev3 > mount $dev1 $mnt > for (( i=0;; i+=2 )) do > xfs_io -f -c "pwrite 0 64k" $mnt/file.$i &> /dev/null > if [ $? -ne 0 ]; then > break > fi > xfs_io -f -c "pwrite 0 64k" $mnt/file.$(($i + 1)) &> /dev/null > if [ $? -ne 0 ]; then > break > fi > sync > done > rm -rf -- $mnt/file.*[02468] > sync > xfs_io -f -c "pwrite 0 4m" $mnt/new_file > --- > > The core idea of above script it, fill the fs using 64K extents. > Then delete half of them interleavely. > > This will make all the full stripes to have one data stripe fully > utilize, one free, and all parity utilized. > > If you go extent allocator that avoid sub-stripe write, then the last > write will fail. > > If you RST with regular devices and COWing P/Q, then the last write will > also fail. > > To me, I don't care about performance or latency, but at least, what we > can do before, but now if a new proposed RAID56 can not do, then to me > it's a regression. > > I'm not against RST, but for RST on regular devices, we still need GC > and reserved block groups to avoid above problems. > > And that's why I still prefer write-intent, it brings no possible > regression. While the test does fail as-is, rebalancing will recover all the wasted space. It's a new gotcha for RAID56, but I think it's still preferable than the write-hole, and is proper CoW. Narrowing the stripes to 4k would waste a lot less space overall, but there's probably code around that depends on the current 64k-tall stripes. > > > > >> there shouldn't be much wasted space (in my case, I > >> have a 12-disk RAID6, so each full stripe holds 640kb, and discounting > >> single-sector writes that should go into metadata space, any > >> reasonable write should fill that buffer in a few seconds). > > Nope, the problem is not that simple. > > Consider this, you have an application doing an 64K write DIO. > > Then with allocator prohibiting sub-stripe write, it will take a full > 640K stripe, wasting 90% of your space. > > > Furthermore, even if you have some buffered write, merged into an 640KiB > full stripe, but later 9 * 64K of data extents in that full stripe get > freed. > Then you can not use that 9 * 64K space anyway. > > That's why zoned device has GC and reserved zones. > > If we go allocator way, then we also need a non-zoned GC and reserved > block groups. > > Good luck implementing that feature just for RAID56 on non-zoned devices. DIO definitely would be a problem this way. As you mention, a separate zone for high;y modified data would make things a lot easier (maybe a RAID1Cx zone), but that definitely would be a huge change on the way things are handled. Another, easier solution would be disabling DIO altogether for RAID56, and I'd prefer that if that's the cost of having RAID56 finally respecting CoW and stopping modifying data shared with other files. But as you say, it's definitely a regression if we change things this way, and we'd need to hear from other people using RAID56 what they'd prefer. > > Thanks, > Qu > > >> > >> The additional suggestion of using smaller stripe widths in case there > >> isn't enough data to fill a whole stripe would make it very easy to > >> reclaim the wasted space by rebalancing with a stripe count filter, > >> which can be easily automated and run very frequently. > >> > >> On-disk format also wouldn't change and be fully usable by older > >> kernels, and it should "only" require changes on the allocator to > >> implement. > >> > >> On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli > >> <kreijack@libero.it> wrote: > >>> > >>> On 14/07/2022 09.46, Johannes Thumshirn wrote: > >>>> On 14.07.22 09:32, Qu Wenruo wrote: > >>>>> [...] > >>>> > >>>> Again if you're doing sub-stripe size writes, you're asking stupid > >>>> things and > >>>> then there's no reason to not give the user stupid answers. > >>>> > >>> > >>> Qu is right, if we consider only full stripe write the "raid hole" > >>> problem > >>> disappear, because if a "full stripe" is not fully written it is not > >>> referenced either. > >>> > >>> > >>> Personally I think that the ZFS variable stripe size, may be interesting > >>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, > >>> we can store different BG with different number of disks. Let me to > >>> make an > >>> example: if we have 10 disks, we could allocate: > >>> 1 BG RAID1 > >>> 1 BG RAID5, spread over 4 disks only > >>> 1 BG RAID5, spread over 8 disks only > >>> 1 BG RAID5, spread over 10 disks > >>> > >>> So if we have short writes, we could put the extents in the RAID1 BG; > >>> for longer > >>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by > >>> length > >>> of the data. > >>> > >>> Yes this would require a sort of garbage collector to move the data > >>> to the biggest > >>> raid5 BG, but this would avoid (or reduce) the fragmentation which > >>> affect the > >>> variable stripe size. > >>> > >>> Doing so we don't need any disk format change and it would be > >>> backward compatible. > >>> > >>> > >>> Moreover, if we could put the smaller BG in the faster disks, we > >>> could have a > >>> decent tiering.... > >>> > >>> > >>>> If a user is concerned about the write or space amplicfication of > >>>> sub-stripe > >>>> writes on RAID56 he/she really needs to rethink the architecture. > >>>> > >>>> > >>>> > >>>> [1] > >>>> S. K. Mishra and P. Mohapatra, > >>>> "Performance study of RAID-5 disk arrays with data and parity cache," > >>>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel > >>>> Processing, > >>>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. > >>> > >>> -- > >>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > >>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > >>>
On 16/07/2022 15.52, Thiago Ramon wrote: >> Good luck implementing that feature just for RAID56 on non-zoned devices. > DIO definitely would be a problem this way. As you mention, a separate > zone for high;y modified data would make things a lot easier (maybe a > RAID1Cx zone), but that definitely would be a huge change on the way > things are handled. When you talk about DIO, do you mean O_DIRECT ? Because this is full reliable even without RAID56... See my email "BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data", sent 15/09/2017
On 2022/7/16 21:52, Thiago Ramon wrote: > On Sat, Jul 16, 2022 at 8:12 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> On 2022/7/16 08:34, Qu Wenruo wrote: >>> >>> >>> On 2022/7/16 03:08, Thiago Ramon wrote: >>>> As a user of RAID6 here, let me jump in because I think this >>>> suggestion is actually a very good compromise. >>>> >>>> With stripes written only once, we completely eliminate any possible >>>> write-hole, and even without any changes on the current disk layout >>>> and allocation, >>> >>> Unfortunately current extent allocator won't understand the requirement >>> at all. >>> >>> Currently the extent allocator although tends to use clustered free >>> space, when it can not find a clustered space, it goes where it can find >>> a free space. No matter if it's a substripe write. >>> >>> >>> Thus to full stripe only write, it's really the old idea about a new >>> extent allocator to avoid sub-stripe writes. >>> >>> Nowadays with the zoned code, I guess it is now more feasible than >>> previous. >>> >>> Now I think it's time to revive the extent allcator idea, and explore >>> the extent allocator based idea, at least it requires no on-disk format >>> change, which even write-intent still needs a on-disk format change (at >>> least needs a compat ro flag) >> >> After more consideration, I am still not confident of above extent >> allocator avoid sub-stripe write. >> >> Especially for the following ENOSPC case (I'll later try submit it as an >> future proof test case for fstests). >> >> --- >> mkfs.btrfs -f -m raid1c3 -d raid5 $dev1 $dev2 $dev3 >> mount $dev1 $mnt >> for (( i=0;; i+=2 )) do >> xfs_io -f -c "pwrite 0 64k" $mnt/file.$i &> /dev/null >> if [ $? -ne 0 ]; then >> break >> fi >> xfs_io -f -c "pwrite 0 64k" $mnt/file.$(($i + 1)) &> /dev/null >> if [ $? -ne 0 ]; then >> break >> fi >> sync >> done >> rm -rf -- $mnt/file.*[02468] >> sync >> xfs_io -f -c "pwrite 0 4m" $mnt/new_file >> --- >> >> The core idea of above script it, fill the fs using 64K extents. >> Then delete half of them interleavely. >> >> This will make all the full stripes to have one data stripe fully >> utilize, one free, and all parity utilized. >> >> If you go extent allocator that avoid sub-stripe write, then the last >> write will fail. >> >> If you RST with regular devices and COWing P/Q, then the last write will >> also fail. >> >> To me, I don't care about performance or latency, but at least, what we >> can do before, but now if a new proposed RAID56 can not do, then to me >> it's a regression. >> >> I'm not against RST, but for RST on regular devices, we still need GC >> and reserved block groups to avoid above problems. >> >> And that's why I still prefer write-intent, it brings no possible >> regression. > While the test does fail as-is, rebalancing will recover all the > wasted space. Nope, the fs is already filled, you have no unallocated space to do balance. That's exactly why zoned btrfs have reserved zones to handle such problem for GC. > It's a new gotcha for RAID56, but I think it's still > preferable than the write-hole, and is proper CoW. > Narrowing the stripes to 4k would waste a lot less space overall, but > there's probably code around that depends on the current 64k-tall > stripes. Yes, limiting stripe size to 4K will cause way less wasted space, but the result is still the same for the worst case script, thus still need garbage collecting and reserved space for GC. Thanks, Qu > >> >>> >>>> there shouldn't be much wasted space (in my case, I >>>> have a 12-disk RAID6, so each full stripe holds 640kb, and discounting >>>> single-sector writes that should go into metadata space, any >>>> reasonable write should fill that buffer in a few seconds). >> >> Nope, the problem is not that simple. >> >> Consider this, you have an application doing an 64K write DIO. >> >> Then with allocator prohibiting sub-stripe write, it will take a full >> 640K stripe, wasting 90% of your space. >> >> >> Furthermore, even if you have some buffered write, merged into an 640KiB >> full stripe, but later 9 * 64K of data extents in that full stripe get >> freed. >> Then you can not use that 9 * 64K space anyway. >> >> That's why zoned device has GC and reserved zones. >> >> If we go allocator way, then we also need a non-zoned GC and reserved >> block groups. >> >> Good luck implementing that feature just for RAID56 on non-zoned devices. > DIO definitely would be a problem this way. As you mention, a separate > zone for high;y modified data would make things a lot easier (maybe a > RAID1Cx zone), but that definitely would be a huge change on the way > things are handled. > Another, easier solution would be disabling DIO altogether for RAID56, > and I'd prefer that if that's the cost of having RAID56 finally > respecting CoW and stopping modifying data shared with other files. > But as you say, it's definitely a regression if we change things this > way, and we'd need to hear from other people using RAID56 what they'd > prefer. > >> >> Thanks, >> Qu >> >>>> >>>> The additional suggestion of using smaller stripe widths in case there >>>> isn't enough data to fill a whole stripe would make it very easy to >>>> reclaim the wasted space by rebalancing with a stripe count filter, >>>> which can be easily automated and run very frequently. >>>> >>>> On-disk format also wouldn't change and be fully usable by older >>>> kernels, and it should "only" require changes on the allocator to >>>> implement. >>>> >>>> On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli >>>> <kreijack@libero.it> wrote: >>>>> >>>>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>>>> [...] >>>>>> >>>>>> Again if you're doing sub-stripe size writes, you're asking stupid >>>>>> things and >>>>>> then there's no reason to not give the user stupid answers. >>>>>> >>>>> >>>>> Qu is right, if we consider only full stripe write the "raid hole" >>>>> problem >>>>> disappear, because if a "full stripe" is not fully written it is not >>>>> referenced either. >>>>> >>>>> >>>>> Personally I think that the ZFS variable stripe size, may be interesting >>>>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >>>>> we can store different BG with different number of disks. Let me to >>>>> make an >>>>> example: if we have 10 disks, we could allocate: >>>>> 1 BG RAID1 >>>>> 1 BG RAID5, spread over 4 disks only >>>>> 1 BG RAID5, spread over 8 disks only >>>>> 1 BG RAID5, spread over 10 disks >>>>> >>>>> So if we have short writes, we could put the extents in the RAID1 BG; >>>>> for longer >>>>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by >>>>> length >>>>> of the data. >>>>> >>>>> Yes this would require a sort of garbage collector to move the data >>>>> to the biggest >>>>> raid5 BG, but this would avoid (or reduce) the fragmentation which >>>>> affect the >>>>> variable stripe size. >>>>> >>>>> Doing so we don't need any disk format change and it would be >>>>> backward compatible. >>>>> >>>>> >>>>> Moreover, if we could put the smaller BG in the faster disks, we >>>>> could have a >>>>> decent tiering.... >>>>> >>>>> >>>>>> If a user is concerned about the write or space amplicfication of >>>>>> sub-stripe >>>>>> writes on RAID56 he/she really needs to rethink the architecture. >>>>>> >>>>>> >>>>>> >>>>>> [1] >>>>>> S. K. Mishra and P. Mohapatra, >>>>>> "Performance study of RAID-5 disk arrays with data and parity cache," >>>>>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel >>>>>> Processing, >>>>>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. >>>>> >>>>> -- >>>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >>>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>>>>
On Sat, Jul 16, 2022 at 9:30 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2022/7/16 21:52, Thiago Ramon wrote: > > On Sat, Jul 16, 2022 at 8:12 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >> > >> > >> > >> On 2022/7/16 08:34, Qu Wenruo wrote: > >>> > >>> > >>> On 2022/7/16 03:08, Thiago Ramon wrote: > >>>> As a user of RAID6 here, let me jump in because I think this > >>>> suggestion is actually a very good compromise. > >>>> > >>>> With stripes written only once, we completely eliminate any possible > >>>> write-hole, and even without any changes on the current disk layout > >>>> and allocation, > >>> > >>> Unfortunately current extent allocator won't understand the requirement > >>> at all. > >>> > >>> Currently the extent allocator although tends to use clustered free > >>> space, when it can not find a clustered space, it goes where it can find > >>> a free space. No matter if it's a substripe write. > >>> > >>> > >>> Thus to full stripe only write, it's really the old idea about a new > >>> extent allocator to avoid sub-stripe writes. > >>> > >>> Nowadays with the zoned code, I guess it is now more feasible than > >>> previous. > >>> > >>> Now I think it's time to revive the extent allcator idea, and explore > >>> the extent allocator based idea, at least it requires no on-disk format > >>> change, which even write-intent still needs a on-disk format change (at > >>> least needs a compat ro flag) > >> > >> After more consideration, I am still not confident of above extent > >> allocator avoid sub-stripe write. > >> > >> Especially for the following ENOSPC case (I'll later try submit it as an > >> future proof test case for fstests). > >> > >> --- > >> mkfs.btrfs -f -m raid1c3 -d raid5 $dev1 $dev2 $dev3 > >> mount $dev1 $mnt > >> for (( i=0;; i+=2 )) do > >> xfs_io -f -c "pwrite 0 64k" $mnt/file.$i &> /dev/null > >> if [ $? -ne 0 ]; then > >> break > >> fi > >> xfs_io -f -c "pwrite 0 64k" $mnt/file.$(($i + 1)) &> /dev/null > >> if [ $? -ne 0 ]; then > >> break > >> fi > >> sync > >> done > >> rm -rf -- $mnt/file.*[02468] > >> sync > >> xfs_io -f -c "pwrite 0 4m" $mnt/new_file > >> --- > >> > >> The core idea of above script it, fill the fs using 64K extents. > >> Then delete half of them interleavely. > >> > >> This will make all the full stripes to have one data stripe fully > >> utilize, one free, and all parity utilized. > >> > >> If you go extent allocator that avoid sub-stripe write, then the last > >> write will fail. > >> > >> If you RST with regular devices and COWing P/Q, then the last write will > >> also fail. > >> > >> To me, I don't care about performance or latency, but at least, what we > >> can do before, but now if a new proposed RAID56 can not do, then to me > >> it's a regression. > >> > >> I'm not against RST, but for RST on regular devices, we still need GC > >> and reserved block groups to avoid above problems. > >> > >> And that's why I still prefer write-intent, it brings no possible > >> regression. > > While the test does fail as-is, rebalancing will recover all the > > wasted space. > > Nope, the fs is already filled, you have no unallocated space to do balance. > > That's exactly why zoned btrfs have reserved zones to handle such > problem for GC. Very good point. What would be the implementation difficulty and overall impact of ALWAYS reserving space, for exclusive balance usage, for at least 1 metadata or data block group, whichever is larger? This would obviously create some unusable space on the FS, but I think this would solve the majority of ENOSPC problems with all profiles. Of course an option to disable this would also be needed for advanced usage, but it sounds like a decent default. > > > It's a new gotcha for RAID56, but I think it's still > > preferable than the write-hole, and is proper CoW. > > Narrowing the stripes to 4k would waste a lot less space overall, but > > there's probably code around that depends on the current 64k-tall > > stripes. > > Yes, limiting stripe size to 4K will cause way less wasted space, but > the result is still the same for the worst case script, thus still need > garbage collecting and reserved space for GC. > > Thanks, > Qu > > > > >> > >>> > >>>> there shouldn't be much wasted space (in my case, I > >>>> have a 12-disk RAID6, so each full stripe holds 640kb, and discounting > >>>> single-sector writes that should go into metadata space, any > >>>> reasonable write should fill that buffer in a few seconds). > >> > >> Nope, the problem is not that simple. > >> > >> Consider this, you have an application doing an 64K write DIO. > >> > >> Then with allocator prohibiting sub-stripe write, it will take a full > >> 640K stripe, wasting 90% of your space. > >> > >> > >> Furthermore, even if you have some buffered write, merged into an 640KiB > >> full stripe, but later 9 * 64K of data extents in that full stripe get > >> freed. > >> Then you can not use that 9 * 64K space anyway. > >> > >> That's why zoned device has GC and reserved zones. > >> > >> If we go allocator way, then we also need a non-zoned GC and reserved > >> block groups. > >> > >> Good luck implementing that feature just for RAID56 on non-zoned devices. > > DIO definitely would be a problem this way. As you mention, a separate > > zone for high;y modified data would make things a lot easier (maybe a > > RAID1Cx zone), but that definitely would be a huge change on the way > > things are handled. > > Another, easier solution would be disabling DIO altogether for RAID56, > > and I'd prefer that if that's the cost of having RAID56 finally > > respecting CoW and stopping modifying data shared with other files. > > But as you say, it's definitely a regression if we change things this > > way, and we'd need to hear from other people using RAID56 what they'd > > prefer. > > > >> > >> Thanks, > >> Qu > >> > >>>> > >>>> The additional suggestion of using smaller stripe widths in case there > >>>> isn't enough data to fill a whole stripe would make it very easy to > >>>> reclaim the wasted space by rebalancing with a stripe count filter, > >>>> which can be easily automated and run very frequently. > >>>> > >>>> On-disk format also wouldn't change and be fully usable by older > >>>> kernels, and it should "only" require changes on the allocator to > >>>> implement. > >>>> > >>>> On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli > >>>> <kreijack@libero.it> wrote: > >>>>> > >>>>> On 14/07/2022 09.46, Johannes Thumshirn wrote: > >>>>>> On 14.07.22 09:32, Qu Wenruo wrote: > >>>>>>> [...] > >>>>>> > >>>>>> Again if you're doing sub-stripe size writes, you're asking stupid > >>>>>> things and > >>>>>> then there's no reason to not give the user stupid answers. > >>>>>> > >>>>> > >>>>> Qu is right, if we consider only full stripe write the "raid hole" > >>>>> problem > >>>>> disappear, because if a "full stripe" is not fully written it is not > >>>>> referenced either. > >>>>> > >>>>> > >>>>> Personally I think that the ZFS variable stripe size, may be interesting > >>>>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, > >>>>> we can store different BG with different number of disks. Let me to > >>>>> make an > >>>>> example: if we have 10 disks, we could allocate: > >>>>> 1 BG RAID1 > >>>>> 1 BG RAID5, spread over 4 disks only > >>>>> 1 BG RAID5, spread over 8 disks only > >>>>> 1 BG RAID5, spread over 10 disks > >>>>> > >>>>> So if we have short writes, we could put the extents in the RAID1 BG; > >>>>> for longer > >>>>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by > >>>>> length > >>>>> of the data. > >>>>> > >>>>> Yes this would require a sort of garbage collector to move the data > >>>>> to the biggest > >>>>> raid5 BG, but this would avoid (or reduce) the fragmentation which > >>>>> affect the > >>>>> variable stripe size. > >>>>> > >>>>> Doing so we don't need any disk format change and it would be > >>>>> backward compatible. > >>>>> > >>>>> > >>>>> Moreover, if we could put the smaller BG in the faster disks, we > >>>>> could have a > >>>>> decent tiering.... > >>>>> > >>>>> > >>>>>> If a user is concerned about the write or space amplicfication of > >>>>>> sub-stripe > >>>>>> writes on RAID56 he/she really needs to rethink the architecture. > >>>>>> > >>>>>> > >>>>>> > >>>>>> [1] > >>>>>> S. K. Mishra and P. Mohapatra, > >>>>>> "Performance study of RAID-5 disk arrays with data and parity cache," > >>>>>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel > >>>>>> Processing, > >>>>>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. > >>>>> > >>>>> -- > >>>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > >>>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > >>>>>
On 16/07/2022 16.26, Goffredo Baroncelli wrote: > On 16/07/2022 15.52, Thiago Ramon wrote: >>> Good luck implementing that feature just for RAID56 on non-zoned devices. >> DIO definitely would be a problem this way. As you mention, a separate >> zone for high;y modified data would make things a lot easier (maybe a >> RAID1Cx zone), but that definitely would be a huge change on the way >> things are handled. > > > When you talk about DIO, do you mean O_DIRECT ? Because this is full reliable > even without RAID56... ehmm... I forgot a "not". So my last sentence is Because this is NOT fully reliable even without RAID56... > See my email > > "BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data", sent 15/09/2017 > > >
On 2022/7/17 23:18, Thiago Ramon wrote: > On Sat, Jul 16, 2022 at 9:30 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> >> >> On 2022/7/16 21:52, Thiago Ramon wrote: >>> On Sat, Jul 16, 2022 at 8:12 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>> >>>> >>>> >>>> On 2022/7/16 08:34, Qu Wenruo wrote: >>>>> >>>>> >>>>> On 2022/7/16 03:08, Thiago Ramon wrote: >>>>>> As a user of RAID6 here, let me jump in because I think this >>>>>> suggestion is actually a very good compromise. >>>>>> >>>>>> With stripes written only once, we completely eliminate any possible >>>>>> write-hole, and even without any changes on the current disk layout >>>>>> and allocation, >>>>> >>>>> Unfortunately current extent allocator won't understand the requirement >>>>> at all. >>>>> >>>>> Currently the extent allocator although tends to use clustered free >>>>> space, when it can not find a clustered space, it goes where it can find >>>>> a free space. No matter if it's a substripe write. >>>>> >>>>> >>>>> Thus to full stripe only write, it's really the old idea about a new >>>>> extent allocator to avoid sub-stripe writes. >>>>> >>>>> Nowadays with the zoned code, I guess it is now more feasible than >>>>> previous. >>>>> >>>>> Now I think it's time to revive the extent allcator idea, and explore >>>>> the extent allocator based idea, at least it requires no on-disk format >>>>> change, which even write-intent still needs a on-disk format change (at >>>>> least needs a compat ro flag) >>>> >>>> After more consideration, I am still not confident of above extent >>>> allocator avoid sub-stripe write. >>>> >>>> Especially for the following ENOSPC case (I'll later try submit it as an >>>> future proof test case for fstests). >>>> >>>> --- >>>> mkfs.btrfs -f -m raid1c3 -d raid5 $dev1 $dev2 $dev3 >>>> mount $dev1 $mnt >>>> for (( i=0;; i+=2 )) do >>>> xfs_io -f -c "pwrite 0 64k" $mnt/file.$i &> /dev/null >>>> if [ $? -ne 0 ]; then >>>> break >>>> fi >>>> xfs_io -f -c "pwrite 0 64k" $mnt/file.$(($i + 1)) &> /dev/null >>>> if [ $? -ne 0 ]; then >>>> break >>>> fi >>>> sync >>>> done >>>> rm -rf -- $mnt/file.*[02468] >>>> sync >>>> xfs_io -f -c "pwrite 0 4m" $mnt/new_file >>>> --- >>>> >>>> The core idea of above script it, fill the fs using 64K extents. >>>> Then delete half of them interleavely. >>>> >>>> This will make all the full stripes to have one data stripe fully >>>> utilize, one free, and all parity utilized. >>>> >>>> If you go extent allocator that avoid sub-stripe write, then the last >>>> write will fail. >>>> >>>> If you RST with regular devices and COWing P/Q, then the last write will >>>> also fail. >>>> >>>> To me, I don't care about performance or latency, but at least, what we >>>> can do before, but now if a new proposed RAID56 can not do, then to me >>>> it's a regression. >>>> >>>> I'm not against RST, but for RST on regular devices, we still need GC >>>> and reserved block groups to avoid above problems. >>>> >>>> And that's why I still prefer write-intent, it brings no possible >>>> regression. >>> While the test does fail as-is, rebalancing will recover all the >>> wasted space. >> >> Nope, the fs is already filled, you have no unallocated space to do balance. >> >> That's exactly why zoned btrfs have reserved zones to handle such >> problem for GC. > > Very good point. What would be the implementation difficulty and > overall impact of ALWAYS reserving space, To me, it's not simple, especially for non-zoned devices, we don't have any existing unit like zones to do reservation. And since we support device with uneven size, it can be pretty tricky to calculate what size we should really reserve. At least for now for non-zoned device, we don't have extra reservation of unallocated space, nor the auto reclaim mechanism. We may learn from zoned code, but I'm not confident if we should jump into the rabbit hole at all, especially we already have a write-intent implementation to avoid all the problem at least for non-zoned devices. Another (smaller) problem is latency, if we ran of out space, we need to kick in GC to reclaim space. IIRC for zoned device it's mostly balancing near-empty zones into a new zone, which can definitely introduce latency. Thanks, Qu > for exclusive balance usage, > for at least 1 metadata or data block group, whichever is larger? > This would obviously create some unusable space on the FS, but I think > this would solve the majority of ENOSPC problems with all profiles. Of > course an option to disable this would also be needed for advanced > usage, but it sounds like a decent default. > >> >>> It's a new gotcha for RAID56, but I think it's still >>> preferable than the write-hole, and is proper CoW. >>> Narrowing the stripes to 4k would waste a lot less space overall, but >>> there's probably code around that depends on the current 64k-tall >>> stripes. >> >> Yes, limiting stripe size to 4K will cause way less wasted space, but >> the result is still the same for the worst case script, thus still need >> garbage collecting and reserved space for GC. >> >> Thanks, >> Qu >> >>> >>>> >>>>> >>>>>> there shouldn't be much wasted space (in my case, I >>>>>> have a 12-disk RAID6, so each full stripe holds 640kb, and discounting >>>>>> single-sector writes that should go into metadata space, any >>>>>> reasonable write should fill that buffer in a few seconds). >>>> >>>> Nope, the problem is not that simple. >>>> >>>> Consider this, you have an application doing an 64K write DIO. >>>> >>>> Then with allocator prohibiting sub-stripe write, it will take a full >>>> 640K stripe, wasting 90% of your space. >>>> >>>> >>>> Furthermore, even if you have some buffered write, merged into an 640KiB >>>> full stripe, but later 9 * 64K of data extents in that full stripe get >>>> freed. >>>> Then you can not use that 9 * 64K space anyway. >>>> >>>> That's why zoned device has GC and reserved zones. >>>> >>>> If we go allocator way, then we also need a non-zoned GC and reserved >>>> block groups. >>>> >>>> Good luck implementing that feature just for RAID56 on non-zoned devices. >>> DIO definitely would be a problem this way. As you mention, a separate >>> zone for high;y modified data would make things a lot easier (maybe a >>> RAID1Cx zone), but that definitely would be a huge change on the way >>> things are handled. >>> Another, easier solution would be disabling DIO altogether for RAID56, >>> and I'd prefer that if that's the cost of having RAID56 finally >>> respecting CoW and stopping modifying data shared with other files. >>> But as you say, it's definitely a regression if we change things this >>> way, and we'd need to hear from other people using RAID56 what they'd >>> prefer. >>> >>>> >>>> Thanks, >>>> Qu >>>> >>>>>> >>>>>> The additional suggestion of using smaller stripe widths in case there >>>>>> isn't enough data to fill a whole stripe would make it very easy to >>>>>> reclaim the wasted space by rebalancing with a stripe count filter, >>>>>> which can be easily automated and run very frequently. >>>>>> >>>>>> On-disk format also wouldn't change and be fully usable by older >>>>>> kernels, and it should "only" require changes on the allocator to >>>>>> implement. >>>>>> >>>>>> On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli >>>>>> <kreijack@libero.it> wrote: >>>>>>> >>>>>>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>>>>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>>>>>> [...] >>>>>>>> >>>>>>>> Again if you're doing sub-stripe size writes, you're asking stupid >>>>>>>> things and >>>>>>>> then there's no reason to not give the user stupid answers. >>>>>>>> >>>>>>> >>>>>>> Qu is right, if we consider only full stripe write the "raid hole" >>>>>>> problem >>>>>>> disappear, because if a "full stripe" is not fully written it is not >>>>>>> referenced either. >>>>>>> >>>>>>> >>>>>>> Personally I think that the ZFS variable stripe size, may be interesting >>>>>>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >>>>>>> we can store different BG with different number of disks. Let me to >>>>>>> make an >>>>>>> example: if we have 10 disks, we could allocate: >>>>>>> 1 BG RAID1 >>>>>>> 1 BG RAID5, spread over 4 disks only >>>>>>> 1 BG RAID5, spread over 8 disks only >>>>>>> 1 BG RAID5, spread over 10 disks >>>>>>> >>>>>>> So if we have short writes, we could put the extents in the RAID1 BG; >>>>>>> for longer >>>>>>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by >>>>>>> length >>>>>>> of the data. >>>>>>> >>>>>>> Yes this would require a sort of garbage collector to move the data >>>>>>> to the biggest >>>>>>> raid5 BG, but this would avoid (or reduce) the fragmentation which >>>>>>> affect the >>>>>>> variable stripe size. >>>>>>> >>>>>>> Doing so we don't need any disk format change and it would be >>>>>>> backward compatible. >>>>>>> >>>>>>> >>>>>>> Moreover, if we could put the smaller BG in the faster disks, we >>>>>>> could have a >>>>>>> decent tiering.... >>>>>>> >>>>>>> >>>>>>>> If a user is concerned about the write or space amplicfication of >>>>>>>> sub-stripe >>>>>>>> writes on RAID56 he/she really needs to rethink the architecture. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [1] >>>>>>>> S. K. Mishra and P. Mohapatra, >>>>>>>> "Performance study of RAID-5 disk arrays with data and parity cache," >>>>>>>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel >>>>>>>> Processing, >>>>>>>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. >>>>>>> >>>>>>> -- >>>>>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >>>>>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>>>>>>
On Sat, Jul 16, 2022 at 08:34:30AM +0800, Qu Wenruo wrote: > > > On 2022/7/16 03:08, Thiago Ramon wrote: > > As a user of RAID6 here, let me jump in because I think this > > suggestion is actually a very good compromise. > > > > With stripes written only once, we completely eliminate any possible > > write-hole, and even without any changes on the current disk layout > > and allocation, > > Unfortunately current extent allocator won't understand the requirement > at all. > > Currently the extent allocator although tends to use clustered free > space, when it can not find a clustered space, it goes where it can find > a free space. No matter if it's a substripe write. > Thus to full stripe only write, it's really the old idea about a new > extent allocator to avoid sub-stripe writes. > Nowadays with the zoned code, I guess it is now more feasible than previous. It's certainly easier, but the gotcha at the bottom of the pile for stripe-level GC on raid5 for btrfs is that raid5 stripe boundaries don't match btrfs extent boundaries. If I write some extents of various sizes: Extents: [4k][24k][64k][160k][--512k--][200k][100k] Stripes: [---384k--------------][---384k-][---384k---] If that 64K extent is freed, and I later write new data to it, then in theory I have to CoW the 4k, 24k, 160k extents, and _parts of_ the 512k extent, or GC needs to be able to split extents (with an explosion of fragmentation as all the existing extents are sliced up in fractions-of-384k sized pieces). Both options involve significant IO amplification (reads and writes) at write time, with the worst case being something like: Extents: ...--][128M][-8K-][128M][... Stripes: ...][384k][384k][384k][... where there's a 384k raid stripe that contains parts of two 128M extents and a 4K free space, and CoW does 256MB of IO on data blocks alone. All of the above seems like an insane path we don't want to follow. The main points of WIL (and as far as I can tell, also of RST) are: - it's a tree that translates logical bytenrs to new physical bytenrs so you can do CoW (RST) or journalling (WIL) on raid56 stripes - it's persistent on disk in mirrored (non-parity) metadata, so the write hole is closed and no committed data is lost on crash (note we don't need to ever make parity metadata work because mirrored metadata will suffice, so this solution does not have to be adapted to work for metadata) - the tree is used to perform CoW on the raid stripe level, not the btrfs extent level, i.e. everything this tree does is invisible to btrfs extent, csum, and subvol trees. It basically behaves like a writeback cache for RMW stripe updates. On non-zoned devices, write intent log could write a complete stripe in a new location, record the new location in a WIL tree, commit, overwrite the stripe in the original location, delete the new location from the WIL tree, and commit again. This effectively makes raid5 stripes CoW instead of RMW, and closes the write hole. There's no need to modify any other btrfs trees, which is good because relocation is expensive compared to the overhead of overwriting the unmodified blocks in a stripe for non-zoned devices. Full-stripe writes don't require any of this, so they go straight to the disk and leave no trace in WIL. A writeback thread can handle flushing WIL entries back to original stripe locations in the background, and a small amount of extra space will be used while that thread catches up to writes from previous transactions. There's no need to do anything new with the allocator for this, because everything is hidden in the btrfs raid5/6 profile layer and the WIL tree, so the existing clustered allocator is fine (though a RMW-avoiding allocator would be faster). On zoned devices, none of this seems necessary or useful, and some of it is actively harmful. We can't overwrite data in place, so we get no benefit from a shortcut that might allow us to. Once a stripe is written, it's stuck in a read-only state until every extent that references the stripe is deleted (by deleting the containing block group). There's no requirement to copy a stripe at any time, since any new writes could simply get allocated to extents in a new raid stripe. When we are reclaiming space from a zone in GC, we want to copy only the data that remains in existing extents, not the leftover unused blocks in the raid stripes that contain the extents, so we simply perform exactly that copy in reclaim. For zoned device reclaim we _want_ all of the btrfs trees (csum, extent, and subvol) to have extent-level visibility so that we can avoid copying data from stripes that contain extents we didn't modify or that were later deleted. ISTM that zoned devices naturally fix the btrfs raid5/6 write hole issues without any special effort because their limitations wrt overwrites simply don't allow the write hole to be implemented. > Now I think it's time to revive the extent allcator idea, and explore > the extent allocator based idea, at least it requires no on-disk format > change, which even write-intent still needs a on-disk format change (at > least needs a compat ro flag) This is the attractive feature about getting the allocator disciplined so that RMW isn't needed any more. It can reuse all the work of the zoned implementation, except with the ability to allocate a full raid stripe in any block group, not just the few that are opened for appending. This would introduce a new requirement for existing raid5 filesystems that several BGs are reserved for reclaim; however, this is not a particularly onerous requirement since several BGs have to be reserved for metadata expansion to avoid ENOSPC already, and there's no automation for this in the filesystem. Also raid5 filesystems are typically larger than average and can afford a few hundred spare GB. btrfs-cleaner only has to be taught to not delete every single empty block group, but leave a few spares allocated for GC. > Thanks, > Qu > > > there shouldn't be much wasted space (in my case, I > > have a 12-disk RAID6, so each full stripe holds 640kb, and discounting > > single-sector writes that should go into metadata space, any > > reasonable write should fill that buffer in a few seconds). > > > > The additional suggestion of using smaller stripe widths in case there > > isn't enough data to fill a whole stripe would make it very easy to > > reclaim the wasted space by rebalancing with a stripe count filter, > > which can be easily automated and run very frequently. > > > > On-disk format also wouldn't change and be fully usable by older > > kernels, and it should "only" require changes on the allocator to > > implement. > > > > On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli <kreijack@libero.it> wrote: > > > > > > On 14/07/2022 09.46, Johannes Thumshirn wrote: > > > > On 14.07.22 09:32, Qu Wenruo wrote: > > > > > [...] > > > > > > > > Again if you're doing sub-stripe size writes, you're asking stupid things and > > > > then there's no reason to not give the user stupid answers. > > > > > > > > > > Qu is right, if we consider only full stripe write the "raid hole" problem > > > disappear, because if a "full stripe" is not fully written it is not > > > referenced either. > > > > > > > > > Personally I think that the ZFS variable stripe size, may be interesting > > > to evaluate. Moreover, because the BTRFS disk format is quite flexible, > > > we can store different BG with different number of disks. Let me to make an > > > example: if we have 10 disks, we could allocate: > > > 1 BG RAID1 > > > 1 BG RAID5, spread over 4 disks only > > > 1 BG RAID5, spread over 8 disks only > > > 1 BG RAID5, spread over 10 disks > > > > > > So if we have short writes, we could put the extents in the RAID1 BG; for longer > > > writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length > > > of the data. > > > > > > Yes this would require a sort of garbage collector to move the data to the biggest > > > raid5 BG, but this would avoid (or reduce) the fragmentation which affect the > > > variable stripe size. > > > > > > Doing so we don't need any disk format change and it would be backward compatible. > > > > > > > > > Moreover, if we could put the smaller BG in the faster disks, we could have a > > > decent tiering.... > > > > > > > > > > If a user is concerned about the write or space amplicfication of sub-stripe > > > > writes on RAID56 he/she really needs to rethink the architecture. > > > > > > > > > > > > > > > > [1] > > > > S. K. Mishra and P. Mohapatra, > > > > "Performance study of RAID-5 disk arrays with data and parity cache," > > > > Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, > > > > 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. > > > > > > -- > > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 > > >
On 2022/7/18 07:00, Zygo Blaxell wrote: > On Sat, Jul 16, 2022 at 08:34:30AM +0800, Qu Wenruo wrote: >> >> >> On 2022/7/16 03:08, Thiago Ramon wrote: >>> As a user of RAID6 here, let me jump in because I think this >>> suggestion is actually a very good compromise. >>> >>> With stripes written only once, we completely eliminate any possible >>> write-hole, and even without any changes on the current disk layout >>> and allocation, >> >> Unfortunately current extent allocator won't understand the requirement >> at all. >> >> Currently the extent allocator although tends to use clustered free >> space, when it can not find a clustered space, it goes where it can find >> a free space. No matter if it's a substripe write. > >> Thus to full stripe only write, it's really the old idea about a new >> extent allocator to avoid sub-stripe writes. > >> Nowadays with the zoned code, I guess it is now more feasible than previous. > > It's certainly easier, but the gotcha at the bottom of the pile for > stripe-level GC on raid5 for btrfs is that raid5 stripe boundaries > don't match btrfs extent boundaries. If I write some extents of various > sizes: > > Extents: [4k][24k][64k][160k][--512k--][200k][100k] > Stripes: [---384k--------------][---384k-][---384k---] That won't be a problem for RST, at least for RST on zoned device. On RST with zoned device, we split extents to match the stripe boundary, thus no problem. (At least that's my educated guess). But this will become a problem for RST on non-zoned device. > > If that 64K extent is freed, and I later write new data to it, then > in theory I have to CoW the 4k, 24k, 160k extents, and _parts of_ > the 512k extent, or GC needs to be able to split extents (with an > explosion of fragmentation as all the existing extents are sliced up > in fractions-of-384k sized pieces). Both options involve significant > IO amplification (reads and writes) at write time, with the worst case > being something like: > > Extents: ...--][128M][-8K-][128M][... > Stripes: ...][384k][384k][384k][... > > where there's a 384k raid stripe that contains parts of two 128M extents > and a 4K free space, and CoW does 256MB of IO on data blocks alone. > All of the above seems like an insane path we don't want to follow. > > The main points of WIL (and as far as I can tell, also of RST) are: > > - it's a tree that translates logical bytenrs to new physical > bytenrs so you can do CoW (RST) or journalling (WIL) on raid56 > stripes > > - it's persistent on disk in mirrored (non-parity) metadata, > so the write hole is closed and no committed data is lost on > crash (note we don't need to ever make parity metadata work > because mirrored metadata will suffice, so this solution does > not have to be adapted to work for metadata) > > - the tree is used to perform CoW on the raid stripe level, > not the btrfs extent level, i.e. everything this tree does is > invisible to btrfs extent, csum, and subvol trees. I'm afraid we can not really do pure RAID level COW, without really touching the extent layer. At least for now, zoned code has already changed extent allocator to always go forward to compensate for the zoned limitations. I know this is not a good idea for layer separation we want, but unfortunately zoned support really depends on that. And if we really go the RST with transparent COW without touching extent allocator, the complexity would go sky rocket. > > It basically behaves like a writeback cache for RMW stripe updates. > > On non-zoned devices, write intent log could write a complete stripe in > a new location, record the new location in a WIL tree, commit, overwrite > the stripe in the original location, delete the new location from the > WIL tree, and commit again. This effectively makes raid5 stripes CoW > instead of RMW, and closes the write hole. There's no need to modify > any other btrfs trees, which is good because relocation is expensive > compared to the overhead of overwriting the unmodified blocks in a stripe > for non-zoned devices. Yep, that's why I'm so strongly pushing for write-intent bitmaps. It really is the least astonishment way to go, but at the cost of no zoned support. But considering how bad reputation nowadays SMR devices receive, I doubt normal end users would even consider zoned device for RAID56. > Full-stripe writes don't require any of this, > so they go straight to the disk and leave no trace in WIL. Exactly the optimization I want to go, although not in current write-intent code, it only needs less than 10 lines to implement. > A writeback > thread can handle flushing WIL entries back to original stripe locations > in the background, and a small amount of extra space will be used while > that thread catches up to writes from previous transactions. I don't think we need a dedicated flushing thread. Currently for write-intent bitmaps, it's too small to really need to wait. (and dm-bitmap also goes this way, if the bitmap overflows, we just wait and retry). For full journal, we can just wait for the endio function to free up some space before we need to add new journal. > There's no > need to do anything new with the allocator for this, because everything > is hidden in the btrfs raid5/6 profile layer and the WIL tree, so the > existing clustered allocator is fine (though a RMW-avoiding allocator > would be faster). > > On zoned devices, none of this seems necessary or useful, and some of > it is actively harmful. We can't overwrite data in place, so we get no > benefit from a shortcut that might allow us to. Once a stripe is written, > it's stuck in a read-only state until every extent that references the > stripe is deleted (by deleting the containing block group). There's no > requirement to copy a stripe at any time, since any new writes could > simply get allocated to extents in a new raid stripe. When we are > reclaiming space from a zone in GC, we want to copy only the data that > remains in existing extents, not the leftover unused blocks in the raid > stripes that contain the extents, so we simply perform exactly that copy > in reclaim. For zoned device reclaim we _want_ all of the btrfs trees > (csum, extent, and subvol) to have extent-level visibility so that we can > avoid copying data from stripes that contain extents we didn't modify > or that were later deleted. > > ISTM that zoned devices naturally fix the btrfs raid5/6 write hole issues > without any special effort because their limitations wrt overwrites > simply don't allow the write hole to be implemented. Not that simple, but mostly correct. For zoned device, they need RST anyway for data RAID support (not only RAID56). And since their RST is also protected by transaction, we won't have the sub-stripe overwrite problem, as if we lost power, we still see the old RST tree. But we will face the new challenges like GC and reserved zones just to handle the script I mentioned above. > >> Now I think it's time to revive the extent allcator idea, and explore >> the extent allocator based idea, at least it requires no on-disk format >> change, which even write-intent still needs a on-disk format change (at >> least needs a compat ro flag) > > This is the attractive feature about getting the allocator disciplined > so that RMW isn't needed any more. It can reuse all the work of the > zoned implementation, except with the ability to allocate a full raid > stripe in any block group, not just the few that are opened for appending. > > This would introduce a new requirement for existing raid5 filesystems that > several BGs are reserved for reclaim; however, this is not a particularly > onerous requirement since several BGs have to be reserved for metadata > expansion to avoid ENOSPC already, Sorry, we don't reserved BGs at all for non-zoned devices. In fact, zoned devices can do reserved zones because they have the unit of zones. If for non-zoned devices, how to calculate the space? Using the similar code like chunk allocator? How to handle added devices or uneven devices? That hardware zones really solves a lot of problems for reserved space. In short, I'm not against RST + cow parity, but not comfort at all for the following things: - Zoned GC mechanism - No support for non-zoned devices at all We still need GC, as long as we go Parity COW, even we can overwrite. Unless there are some prototypes that can pass the script I mentioned above, I'm not in. Thanks, Qu > and there's no automation for this > in the filesystem. Also raid5 filesystems are typically larger than > average and can afford a few hundred spare GB. btrfs-cleaner only has > to be taught to not delete every single empty block group, but leave a > few spares allocated for GC. > >> Thanks, >> Qu >> >>> there shouldn't be much wasted space (in my case, I >>> have a 12-disk RAID6, so each full stripe holds 640kb, and discounting >>> single-sector writes that should go into metadata space, any >>> reasonable write should fill that buffer in a few seconds). >>> >>> The additional suggestion of using smaller stripe widths in case there >>> isn't enough data to fill a whole stripe would make it very easy to >>> reclaim the wasted space by rebalancing with a stripe count filter, >>> which can be easily automated and run very frequently. >>> >>> On-disk format also wouldn't change and be fully usable by older >>> kernels, and it should "only" require changes on the allocator to >>> implement. >>> >>> On Fri, Jul 15, 2022 at 2:58 PM Goffredo Baroncelli <kreijack@libero.it> wrote: >>>> >>>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>>> [...] >>>>> >>>>> Again if you're doing sub-stripe size writes, you're asking stupid things and >>>>> then there's no reason to not give the user stupid answers. >>>>> >>>> >>>> Qu is right, if we consider only full stripe write the "raid hole" problem >>>> disappear, because if a "full stripe" is not fully written it is not >>>> referenced either. >>>> >>>> >>>> Personally I think that the ZFS variable stripe size, may be interesting >>>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >>>> we can store different BG with different number of disks. Let me to make an >>>> example: if we have 10 disks, we could allocate: >>>> 1 BG RAID1 >>>> 1 BG RAID5, spread over 4 disks only >>>> 1 BG RAID5, spread over 8 disks only >>>> 1 BG RAID5, spread over 10 disks >>>> >>>> So if we have short writes, we could put the extents in the RAID1 BG; for longer >>>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length >>>> of the data. >>>> >>>> Yes this would require a sort of garbage collector to move the data to the biggest >>>> raid5 BG, but this would avoid (or reduce) the fragmentation which affect the >>>> variable stripe size. >>>> >>>> Doing so we don't need any disk format change and it would be backward compatible. >>>> >>>> >>>> Moreover, if we could put the smaller BG in the faster disks, we could have a >>>> decent tiering.... >>>> >>>> >>>>> If a user is concerned about the write or space amplicfication of sub-stripe >>>>> writes on RAID56 he/she really needs to rethink the architecture. >>>>> >>>>> >>>>> >>>>> [1] >>>>> S. K. Mishra and P. Mohapatra, >>>>> "Performance study of RAID-5 disk arrays with data and parity cache," >>>>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, >>>>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. >>>> >>>> -- >>>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> >>>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >>>>
On 15.07.22 19:54, Goffredo Baroncelli wrote: > On 14/07/2022 09.46, Johannes Thumshirn wrote: >> On 14.07.22 09:32, Qu Wenruo wrote: >>> [...] >> >> Again if you're doing sub-stripe size writes, you're asking stupid things and >> then there's no reason to not give the user stupid answers. >> > > Qu is right, if we consider only full stripe write the "raid hole" problem > disappear, because if a "full stripe" is not fully written it is not > referenced either. It's not that there wil lbe a new write hole, it's just that there is sub-optimal space consumption until we can either re-write or garbage collect the blocks. > > Personally I think that the ZFS variable stripe size, may be interesting But then we would need extra meta-data to describe the size of each stripe. > to evaluate. Moreover, because the BTRFS disk format is quite flexible, > we can store different BG with different number of disks. Let me to make an > example: if we have 10 disks, we could allocate: > 1 BG RAID1 > 1 BG RAID5, spread over 4 disks only > 1 BG RAID5, spread over 8 disks only > 1 BG RAID5, spread over 10 disks > > So if we have short writes, we could put the extents in the RAID1 BG; for longer > writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length > of the data. > > Yes this would require a sort of garbage collector to move the data to the biggest > raid5 BG, but this would avoid (or reduce) the fragmentation which affect the > variable stripe size. > > Doing so we don't need any disk format change and it would be backward compatible. > > > Moreover, if we could put the smaller BG in the faster disks, we could have a > decent tiering.... > > >> If a user is concerned about the write or space amplicfication of sub-stripe >> writes on RAID56 he/she really needs to rethink the architecture. >> >> >> >> [1] >> S. K. Mishra and P. Mohapatra, >> "Performance study of RAID-5 disk arrays with data and parity cache," >> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, >> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. >
On 15.07.22 22:15, Chris Murphy wrote: > On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli <kreijack@libero.it> wrote: >> >> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>> On 14.07.22 09:32, Qu Wenruo wrote: >>>> [...] >>> >>> Again if you're doing sub-stripe size writes, you're asking stupid things and >>> then there's no reason to not give the user stupid answers. >>> >> >> Qu is right, if we consider only full stripe write the "raid hole" problem >> disappear, because if a "full stripe" is not fully written it is not >> referenced either. >> >> >> Personally I think that the ZFS variable stripe size, may be interesting >> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >> we can store different BG with different number of disks. Let me to make an >> example: if we have 10 disks, we could allocate: >> 1 BG RAID1 >> 1 BG RAID5, spread over 4 disks only >> 1 BG RAID5, spread over 8 disks only >> 1 BG RAID5, spread over 10 disks >> >> So if we have short writes, we could put the extents in the RAID1 BG; for longer >> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length >> of the data. >> >> Yes this would require a sort of garbage collector to move the data to the biggest >> raid5 BG, but this would avoid (or reduce) the fragmentation which affect the >> variable stripe size. >> >> Doing so we don't need any disk format change and it would be backward compatible. > > My 2 cents... > > Regarding the current raid56 support, in order of preference: > > a. Fix the current bugs, without changing format. Zygo has an extensive list. > b. Mostly fix the write hole, also without changing the format, by > only doing COW with full stripe writes. Yes you could somehow get > corrupt parity still and not know it until degraded operation produces > a bad reconstruction of data - but checksum will still catch that. > This kind of "unreplicated corruption" is not quite the same thing as > the write hole, because it isn't pernicious like the write hole. > c. A new de-clustered parity raid56 implementation that is not > backwards compatible. c) is what I'm leaning to/working on, simply for the fact, that it is the the only solution (I can think of at least) to make raid56 working on zoned drives. And given that zoned drives tend to have a higher capacity than regular drives, they are appealing for raid arrays. > Ergo, I think it's best to not break the format twice. Even if a new > raid implementation is years off. Agreed. > Metadata centric workloads suck on parity raid anyway. If Btrfs always > does full stripe COW won't matter even if the performance is worse > because no one should use parity raid for this workload anyway. > Yup.
On 2022/7/18 15:33, Johannes Thumshirn wrote: > On 15.07.22 22:15, Chris Murphy wrote: >> On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli <kreijack@libero.it> wrote: >>> >>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>> [...] >>>> >>>> Again if you're doing sub-stripe size writes, you're asking stupid things and >>>> then there's no reason to not give the user stupid answers. >>>> >>> >>> Qu is right, if we consider only full stripe write the "raid hole" problem >>> disappear, because if a "full stripe" is not fully written it is not >>> referenced either. >>> >>> >>> Personally I think that the ZFS variable stripe size, may be interesting >>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >>> we can store different BG with different number of disks. Let me to make an >>> example: if we have 10 disks, we could allocate: >>> 1 BG RAID1 >>> 1 BG RAID5, spread over 4 disks only >>> 1 BG RAID5, spread over 8 disks only >>> 1 BG RAID5, spread over 10 disks >>> >>> So if we have short writes, we could put the extents in the RAID1 BG; for longer >>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length >>> of the data. >>> >>> Yes this would require a sort of garbage collector to move the data to the biggest >>> raid5 BG, but this would avoid (or reduce) the fragmentation which affect the >>> variable stripe size. >>> >>> Doing so we don't need any disk format change and it would be backward compatible. >> >> My 2 cents... >> >> Regarding the current raid56 support, in order of preference: >> >> a. Fix the current bugs, without changing format. Zygo has an extensive list. >> b. Mostly fix the write hole, also without changing the format, by >> only doing COW with full stripe writes. Yes you could somehow get >> corrupt parity still and not know it until degraded operation produces >> a bad reconstruction of data - but checksum will still catch that. >> This kind of "unreplicated corruption" is not quite the same thing as >> the write hole, because it isn't pernicious like the write hole. >> c. A new de-clustered parity raid56 implementation that is not >> backwards compatible. > > c) is what I'm leaning to/working on, simply for the fact, that it is > the the only solution (I can think of at least) to make raid56 working > on zoned drives. And given that zoned drives tend to have a higher > capacity than regular drives, they are appealing for raid arrays. That's what I can totally agree on. RST is not an optional, but an essential thing to support RAID profiles for data. Thus I'm not against RST on zoned device at all, no matter if it's RAID56 or not. Thanks, Qu > >> Ergo, I think it's best to not break the format twice. Even if a new >> raid implementation is years off. > > Agreed. > >> Metadata centric workloads suck on parity raid anyway. If Btrfs always >> does full stripe COW won't matter even if the performance is worse >> because no one should use parity raid for this workload anyway. >> > > Yup.
---- From: Chris Murphy <lists@colorremedies.com> -- Sent: 2022-07-15 - 22:14 ---- > On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli <kreijack@libero.it> wrote: >> >> On 14/07/2022 09.46, Johannes Thumshirn wrote: >> > On 14.07.22 09:32, Qu Wenruo wrote: >> >>[...] >> > >> > Again if you're doing sub-stripe size writes, you're asking stupid things and >> > then there's no reason to not give the user stupid answers. >> > >> >> Qu is right, if we consider only full stripe write the "raid hole" problem >> disappear, because if a "full stripe" is not fully written it is not >> referenced either. >> >> >> Personally I think that the ZFS variable stripe size, may be interesting >> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >> we can store different BG with different number of disks. We can create new types of BGs too. For example parity BGs. >>Let me to make an >> example: if we have 10 disks, we could allocate: >> 1 BG RAID1 >> 1 BG RAID5, spread over 4 disks only >> 1 BG RAID5, spread over 8 disks only >> 1 BG RAID5, spread over 10 disks >> >> So if we have short writes, we could put the extents in the RAID1 BG; for longer >> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length >> of the data. >> >> Yes this would require a sort of garbage collector to move the data to the biggest >> raid5 BG, but this would avoid (or reduce) the fragmentation which affect the >> variable stripe size. >> >> Doing so we don't need any disk format change and it would be backward compatible. Do we need to implement RAID56 in the traditional sense? As the user/sysadmin I care about redundancy and performance and cost. The option to create redundancy for any 'n drives is appealing from a cost perspective, otherwise I'd use RAID1/10. Since the current RAID56 mode have several important drawbacks - and that it's officially not recommended for production use - it is a good idea to reconstruct new btrfs 'redundant-n' profiles that doesn't have the inherent issues of traditional RAID. For example a non-striped redundant-n profile as well as a striped redundant-n profile. > > My 2 cents... > > Regarding the current raid56 support, in order of preference: > > a. Fix the current bugs, without changing format. Zygo has an extensive list. I agree that relatively simple fixes should be made. But it seems we will need quite a large rewrite to solve all issues? Is there a minium viable option here? > b. Mostly fix the write hole, also without changing the format, by > only doing COW with full stripe writes. Yes you could somehow get > corrupt parity still and not know it until degraded operation produces > a bad reconstruction of data - but checksum will still catch that. > This kind of "unreplicated corruption" is not quite the same thing as > the write hole, because it isn't pernicious like the write hole. What is the difference to a)? Is write hole the worst issue? Judging from the #brtfs channel discussions there seems to be other quite severe issues, for example real data corruption risks in degraded mode. > c. A new de-clustered parity raid56 implementation that is not > backwards compatible. Yes. We have a good opportunity to work out something much better than current implementations. We could have redundant-n profiles that also works with tired storage like ssd/nvme similar to the metadata on ssd idea. Variable stripe width has been brought up before, but received cool responses. Why is that? IMO it could improve random 4k ios by doing equivalent to RAID1 instead of RMW, while also closing the write hole. Perhaps there is a middle ground to be found? > > Ergo, I think it's best to not break the format twice. Even if a new > raid implementation is years off. I very agree here. Btrfs already suffers in public opinion from the lack of a stable and safe-for-data RAID56, and requiring several non-compatible chances isn't going to help. I also think it's important that the 'temporary' changes actually leads to a stable filesystem. Because what is the point otherwise? Thanks Forza > > Metadata centric workloads suck on parity raid anyway. If Btrfs always > does full stripe COW won't matter even if the performance is worse > because no one should use parity raid for this workload anyway. > > > -- > Chris Murphy
On 2022/7/19 05:49, Forza wrote: > > > ---- From: Chris Murphy <lists@colorremedies.com> -- Sent: 2022-07-15 - 22:14 ---- > >> On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli <kreijack@libero.it> wrote: >>> >>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>> [...] >>>> >>>> Again if you're doing sub-stripe size writes, you're asking stupid things and >>>> then there's no reason to not give the user stupid answers. >>>> >>> >>> Qu is right, if we consider only full stripe write the "raid hole" problem >>> disappear, because if a "full stripe" is not fully written it is not >>> referenced either. >>> >>> >>> Personally I think that the ZFS variable stripe size, may be interesting >>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >>> we can store different BG with different number of disks. > > We can create new types of BGs too. For example parity BGs. > >>> Let me to make an >>> example: if we have 10 disks, we could allocate: >>> 1 BG RAID1 >>> 1 BG RAID5, spread over 4 disks only >>> 1 BG RAID5, spread over 8 disks only >>> 1 BG RAID5, spread over 10 disks >>> >>> So if we have short writes, we could put the extents in the RAID1 BG; for longer >>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length >>> of the data. >>> >>> Yes this would require a sort of garbage collector to move the data to the biggest >>> raid5 BG, but this would avoid (or reduce) the fragmentation which affect the >>> variable stripe size. >>> >>> Doing so we don't need any disk format change and it would be backward compatible. > > Do we need to implement RAID56 in the traditional sense? As the user/sysadmin I care about redundancy and performance and cost. The option to create redundancy for any 'n drives is appealing from a cost perspective, otherwise I'd use RAID1/10. Have you heard any recent problems related to dm-raid56? If your answer is no, then I guess we already have an answer to your question. > > Since the current RAID56 mode have several important drawbacks Let me to be clear: If you can ensure you didn't hit power loss, or after a power loss do a scrub immediately before any new write, then current RAID56 is fine, at least not obviously worse than dm-raid56. (There are still common problems shared between both btrfs raid56 and dm-raid56, like destructive-RMW) > - and that it's officially not recommended for production use - it is a good idea to reconstruct new btrfs 'redundant-n' profiles that doesn't have the inherent issues of traditional RAID. I'd say the complexity is hugely underestimated. > For example a non-striped redundant-n profile as well as a striped redundant-n profile. Non-striped redundant-n profile is already so complex that I can't figure out a working idea right now. But if there is such way, I'm pretty happy to consider. > >> >> My 2 cents... >> >> Regarding the current raid56 support, in order of preference: >> >> a. Fix the current bugs, without changing format. Zygo has an extensive list. > > I agree that relatively simple fixes should be made. But it seems we will need quite a large rewrite to solve all issues? Is there a minium viable option here? Nope. Just see my write-intent code, already have prototype (just needs new scrub based recovery code at mount time) working. And based on my write-intent code, I don't think it's that hard to implement a full journal. Thanks, Qu > >> b. Mostly fix the write hole, also without changing the format, by >> only doing COW with full stripe writes. Yes you could somehow get >> corrupt parity still and not know it until degraded operation produces >> a bad reconstruction of data - but checksum will still catch that. >> This kind of "unreplicated corruption" is not quite the same thing as >> the write hole, because it isn't pernicious like the write hole. > > What is the difference to a)? Is write hole the worst issue? Judging from the #brtfs channel discussions there seems to be other quite severe issues, for example real data corruption risks in degraded mode. > >> c. A new de-clustered parity raid56 implementation that is not >> backwards compatible. > > Yes. We have a good opportunity to work out something much better than current implementations. We could have redundant-n profiles that also works with tired storage like ssd/nvme similar to the metadata on ssd idea. > > Variable stripe width has been brought up before, but received cool responses. Why is that? IMO it could improve random 4k ios by doing equivalent to RAID1 instead of RMW, while also closing the write hole. Perhaps there is a middle ground to be found? > > >> >> Ergo, I think it's best to not break the format twice. Even if a new >> raid implementation is years off. > > I very agree here. Btrfs already suffers in public opinion from the lack of a stable and safe-for-data RAID56, and requiring several non-compatible chances isn't going to help. > > I also think it's important that the 'temporary' changes actually leads to a stable filesystem. Because what is the point otherwise? > > Thanks > Forza > >> >> Metadata centric workloads suck on parity raid anyway. If Btrfs always >> does full stripe COW won't matter even if the performance is worse >> because no one should use parity raid for this workload anyway. >> >> >> -- >> Chris Murphy > >
On 18/07/2022 09.30, Johannes Thumshirn wrote: > On 15.07.22 19:54, Goffredo Baroncelli wrote: >> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>> On 14.07.22 09:32, Qu Wenruo wrote: >>>> [...] >>> >>> Again if you're doing sub-stripe size writes, you're asking stupid things and >>> then there's no reason to not give the user stupid answers. >>> >> >> Qu is right, if we consider only full stripe write the "raid hole" problem >> disappear, because if a "full stripe" is not fully written it is not >> referenced either. > > It's not that there wil lbe a new write hole, it's just that there is sub-optimal > space consumption until we can either re-write or garbage collect the blocks. May be that I was not very clear. Let me to repeat: if we assume that we can write only full stripe (padding with 0, if smaller), we don't have the write-hole problem at all, so we can also avoid using RST. >> >> Personally I think that the ZFS variable stripe size, may be interesting > > But then we would need extra meta-data to describe the size of each stripe. It is not needed. The stripe allocation is per extent. The layout of the stripe depends only by the start of the extent and its length. Assuming that you have n disks and a raid5 layout, you already know that each (n-1) data blocks, there is a parity block in the extent. The relation between the length of the extent and the real data stored is extent-length = data-length + 4k*(data-length / 4k / (n-1)) extent-length += 4k if (data-length %(4k * (n-1))) > 0 extent-length = size of the extent (which contains the parity block) data-length = the real length of consecutive data n = number of disk Below some examples that show better my idea: Assuming the following logical address Disk1 0 12k 24k Disk2 4k 16k 28k Disk3 8k 20k .... first write: data size = 1 block Disk1 D1 ... Disk2 P1 ... Disk3 ... Extent = (0, 8K), second write; data size = 3 block ( D2, D2*, D3 ) Disk1 D1 D2* P3 ... Disk2 P1 P2 ... Disk3 D2 D3 ... Extent = (8k, 20K) Write a bigger data, and shape the stripe taller: 3rd write; data size 32k: (D6, D6*, ... D9* Disk1 D1 D2* P3 P68 P6*8* P79 P7*9* ... Disk2 P1 P2 D6 D6* D7 D7* ... Disk3 D2 D3 D8 D8* D9 D9* ... Extent = (28k, 48k) The major drawbacks are: - you can break the extent only at stripe boundary (up to 64K * n) - the scrub is a bit more complex, because it involves some math around the extents start/length - you need to have an extent that describe the stripe. I don't know if this requirements is fulfill by metadata The schema above, has an huge simplification if we allow BTRFS to use BG with dedicated stripe size. Moreover this would reduce the fragmentation, even tough it requires a GC. > >> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >> we can store different BG with different number of disks. Let me to make an >> example: if we have 10 disks, we could allocate: >> 1 BG RAID1 >> 1 BG RAID5, spread over 4 disks only >> 1 BG RAID5, spread over 8 disks only >> 1 BG RAID5, spread over 10 disks >> >> So if we have short writes, we could put the extents in the RAID1 BG; for longer >> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by length >> of the data. >> >> Yes this would require a sort of garbage collector to move the data to the biggest >> raid5 BG, but this would avoid (or reduce) the fragmentation which affect the >> variable stripe size. >> >> Doing so we don't need any disk format change and it would be backward compatible. >> >> >> Moreover, if we could put the smaller BG in the faster disks, we could have a >> decent tiering.... >> >> >>> If a user is concerned about the write or space amplicfication of sub-stripe >>> writes on RAID56 he/she really needs to rethink the architecture. >>> >>> >>> >>> [1] >>> S. K. Mishra and P. Mohapatra, >>> "Performance study of RAID-5 disk arrays with data and parity cache," >>> Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing, >>> 1996, pp. 222-229 vol.1, doi: 10.1109/ICPP.1996.537164. >> >
On 2022-07-19 03:19, Qu Wenruo wrote: > > > On 2022/7/19 05:49, Forza wrote: >> >> >> ---- From: Chris Murphy <lists@colorremedies.com> -- Sent: 2022-07-15 >> - 22:14 ---- >> >>> On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli >>> <kreijack@libero.it> wrote: >>>> >>>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>>> [...] >>>>> >>>>> Again if you're doing sub-stripe size writes, you're asking stupid >>>>> things and >>>>> then there's no reason to not give the user stupid answers. >>>>> >>>> >>>> Qu is right, if we consider only full stripe write the "raid hole" >>>> problem >>>> disappear, because if a "full stripe" is not fully written it is not >>>> referenced either. >>>> >>>> >>>> Personally I think that the ZFS variable stripe size, may be >>>> interesting >>>> to evaluate. Moreover, because the BTRFS disk format is quite flexible, >>>> we can store different BG with different number of disks. >> >> We can create new types of BGs too. For example parity BGs. >> >>>> Let me to make an >>>> example: if we have 10 disks, we could allocate: >>>> 1 BG RAID1 >>>> 1 BG RAID5, spread over 4 disks only >>>> 1 BG RAID5, spread over 8 disks only >>>> 1 BG RAID5, spread over 10 disks >>>> >>>> So if we have short writes, we could put the extents in the RAID1 >>>> BG; for longer >>>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by >>>> length >>>> of the data. >>>> >>>> Yes this would require a sort of garbage collector to move the data >>>> to the biggest >>>> raid5 BG, but this would avoid (or reduce) the fragmentation which >>>> affect the >>>> variable stripe size. >>>> >>>> Doing so we don't need any disk format change and it would be >>>> backward compatible. >> >> Do we need to implement RAID56 in the traditional sense? As the >> user/sysadmin I care about redundancy and performance and cost. The >> option to create redundancy for any 'n drives is appealing from a cost >> perspective, otherwise I'd use RAID1/10. > > Have you heard any recent problems related to dm-raid56? No..? > > If your answer is no, then I guess we already have an answer to your > question. > >> >> Since the current RAID56 mode have several important drawbacks > > Let me to be clear: > > If you can ensure you didn't hit power loss, or after a power loss do a > scrub immediately before any new write, then current RAID56 is fine, at > least not obviously worse than dm-raid56. > > (There are still common problems shared between both btrfs raid56 and > dm-raid56, like destructive-RMW) > >> - and that it's officially not recommended for production use - it is >> a good idea to reconstruct new btrfs 'redundant-n' profiles that >> doesn't have the inherent issues of traditional RAID. > > I'd say the complexity is hugely underestimated. You are probably right. But is it solvable, and is there a vision of 'something better' than traditional RAID56? > >> For example a non-striped redundant-n profile as well as a striped >> redundant-n profile. > > Non-striped redundant-n profile is already so complex that I can't > figure out a working idea right now. > > But if there is such way, I'm pretty happy to consider. Can we borrow ideas from the PAR2/PAR3 format? For each extent, create 'par' redundancy metadata that allows for n-% or n-copies of recovery, and that this metadata is also split on different disks to allow for n total drive-failures? Maybe parity data can be stored in parity BGs, in metadata itself or in special type of extents inside data BGs. > >> >>> >>> My 2 cents... >>> >>> Regarding the current raid56 support, in order of preference: >>> >>> a. Fix the current bugs, without changing format. Zygo has an >>> extensive list. >> >> I agree that relatively simple fixes should be made. But it seems we >> will need quite a large rewrite to solve all issues? Is there a minium >> viable option here? > > Nope. Just see my write-intent code, already have prototype (just needs > new scrub based recovery code at mount time) working. > > And based on my write-intent code, I don't think it's that hard to > implement a full journal. > This is good news. Do you see any other major issues that would need fixing before RADI56 can be considered production-ready? > Thanks, > Qu > >> >>> b. Mostly fix the write hole, also without changing the format, by >>> only doing COW with full stripe writes. Yes you could somehow get >>> corrupt parity still and not know it until degraded operation produces >>> a bad reconstruction of data - but checksum will still catch that. >>> This kind of "unreplicated corruption" is not quite the same thing as >>> the write hole, because it isn't pernicious like the write hole. >> >> What is the difference to a)? Is write hole the worst issue? Judging >> from the #brtfs channel discussions there seems to be other quite >> severe issues, for example real data corruption risks in degraded mode. >> >>> c. A new de-clustered parity raid56 implementation that is not >>> backwards compatible. >> >> Yes. We have a good opportunity to work out something much better than >> current implementations. We could have redundant-n profiles that also >> works with tired storage like ssd/nvme similar to the metadata on ssd >> idea. >> >> Variable stripe width has been brought up before, but received cool >> responses. Why is that? IMO it could improve random 4k ios by doing >> equivalent to RAID1 instead of RMW, while also closing the write hole. >> Perhaps there is a middle ground to be found? >> >> >>> >>> Ergo, I think it's best to not break the format twice. Even if a new >>> raid implementation is years off. >> >> I very agree here. Btrfs already suffers in public opinion from the >> lack of a stable and safe-for-data RAID56, and requiring several >> non-compatible chances isn't going to help. >> >> I also think it's important that the 'temporary' changes actually >> leads to a stable filesystem. Because what is the point otherwise? >> >> Thanks >> Forza >> >>> >>> Metadata centric workloads suck on parity raid anyway. If Btrfs always >>> does full stripe COW won't matter even if the performance is worse >>> because no one should use parity raid for this workload anyway. >>> >>> >>> -- >>> Chris Murphy >> >>
On 2022/7/21 22:51, Forza wrote: > > > On 2022-07-19 03:19, Qu Wenruo wrote: >> >> >> On 2022/7/19 05:49, Forza wrote: >>> >>> >>> ---- From: Chris Murphy <lists@colorremedies.com> -- Sent: 2022-07-15 >>> - 22:14 ---- >>> >>>> On Fri, Jul 15, 2022 at 1:55 PM Goffredo Baroncelli >>>> <kreijack@libero.it> wrote: >>>>> >>>>> On 14/07/2022 09.46, Johannes Thumshirn wrote: >>>>>> On 14.07.22 09:32, Qu Wenruo wrote: >>>>>>> [...] >>>>>> >>>>>> Again if you're doing sub-stripe size writes, you're asking stupid >>>>>> things and >>>>>> then there's no reason to not give the user stupid answers. >>>>>> >>>>> >>>>> Qu is right, if we consider only full stripe write the "raid hole" >>>>> problem >>>>> disappear, because if a "full stripe" is not fully written it is not >>>>> referenced either. >>>>> >>>>> >>>>> Personally I think that the ZFS variable stripe size, may be >>>>> interesting >>>>> to evaluate. Moreover, because the BTRFS disk format is quite >>>>> flexible, >>>>> we can store different BG with different number of disks. >>> >>> We can create new types of BGs too. For example parity BGs. >>> >>>>> Let me to make an >>>>> example: if we have 10 disks, we could allocate: >>>>> 1 BG RAID1 >>>>> 1 BG RAID5, spread over 4 disks only >>>>> 1 BG RAID5, spread over 8 disks only >>>>> 1 BG RAID5, spread over 10 disks >>>>> >>>>> So if we have short writes, we could put the extents in the RAID1 >>>>> BG; for longer >>>>> writes we could use a RAID5 BG with 4 or 8 or 10 disks depending by >>>>> length >>>>> of the data. >>>>> >>>>> Yes this would require a sort of garbage collector to move the data >>>>> to the biggest >>>>> raid5 BG, but this would avoid (or reduce) the fragmentation which >>>>> affect the >>>>> variable stripe size. >>>>> >>>>> Doing so we don't need any disk format change and it would be >>>>> backward compatible. >>> >>> Do we need to implement RAID56 in the traditional sense? As the >>> user/sysadmin I care about redundancy and performance and cost. The >>> option to create redundancy for any 'n drives is appealing from a >>> cost perspective, otherwise I'd use RAID1/10. >> >> Have you heard any recent problems related to dm-raid56? > > No..? Then, I'd say their write-intent + journal (PPL for RAID5, full journal for RAID6) is a tried and true solution. I see no reason not to follow. > >> >> If your answer is no, then I guess we already have an answer to your >> question. >> >>> >>> Since the current RAID56 mode have several important drawbacks >> >> Let me to be clear: >> >> If you can ensure you didn't hit power loss, or after a power loss do a >> scrub immediately before any new write, then current RAID56 is fine, at >> least not obviously worse than dm-raid56. >> >> (There are still common problems shared between both btrfs raid56 and >> dm-raid56, like destructive-RMW) >> >>> - and that it's officially not recommended for production use - it is >>> a good idea to reconstruct new btrfs 'redundant-n' profiles that >>> doesn't have the inherent issues of traditional RAID. >> >> I'd say the complexity is hugely underestimated. > > You are probably right. But is it solvable, and is there a vision of > 'something better' than traditional RAID56? I'd say, maybe. I prefer some encode at file extent level (like compression) to provide extra data recovery, other than relying on stripe based RAID56. The problem is, normally such encoding is to correct data corruption for a small percentage, but for regular RAID1/10 or even small number of disks RAID56, the percentage is not small. (missing 1 disk in 3 disks RAID5, we're in fact recovery 50% of our data) If we can find a good encode (probably used after compression), I'm 100% fine to use that encoding, other than traditional RAID56. > >> >>> For example a non-striped redundant-n profile as well as a striped >>> redundant-n profile. >> >> Non-striped redundant-n profile is already so complex that I can't >> figure out a working idea right now. >> >> But if there is such way, I'm pretty happy to consider. > > Can we borrow ideas from the PAR2/PAR3 format? > > For each extent, create 'par' redundancy metadata that allows for n-% or > n-copies of recovery, and that this metadata is also split on different > disks to allow for n total drive-failures? Maybe parity data can be > stored in parity BGs, in metadata itself or in special type of extents > inside data BGs. The problem is still there, if there is anything representing a stripe, and any calculate extra info based on stripes, then we can still hit the write-hole problem. If we do sub-stripe write, we have to update the checksum or whatever, which can be out-of-sync during power loss. If you mean an extra tree to store all these extra checksum/info (aka, no longer need the stripe unit at all), then I guess it may be possible. Like we use a special csum algorithm which may take way larger space than our current 32 bytes per 4K, then I guess we may be able to get extra redundancy. There will be some problems like different metadata/data csum (metadata csum is limited to 32bytes as it's inlined), and way larger metadata usage for csum. But those should be more or less solvable. > >> >>> >>>> >>>> My 2 cents... >>>> >>>> Regarding the current raid56 support, in order of preference: >>>> >>>> a. Fix the current bugs, without changing format. Zygo has an >>>> extensive list. >>> >>> I agree that relatively simple fixes should be made. But it seems we >>> will need quite a large rewrite to solve all issues? Is there a >>> minium viable option here? >> >> Nope. Just see my write-intent code, already have prototype (just needs >> new scrub based recovery code at mount time) working. >> >> And based on my write-intent code, I don't think it's that hard to >> implement a full journal. >> > > This is good news. Do you see any other major issues that would need > fixing before RADI56 can be considered production-ready? Currently I have only finished write-intent bitmaps, which requires after power loss, all devices are still available and data not touched is still correct. For powerloss + missing device, I have to go full journal, but the code should be pretty similar thus I'm not that concerned. The biggest problem remaining is, write-intent bitmap/full journal requires regular devices, no support for zoned devices at all. Thus zoned guys are not a big fan of this solution. Thanks, Qu > > >> Thanks, >> Qu >> >>> >>>> b. Mostly fix the write hole, also without changing the format, by >>>> only doing COW with full stripe writes. Yes you could somehow get >>>> corrupt parity still and not know it until degraded operation produces >>>> a bad reconstruction of data - but checksum will still catch that. >>>> This kind of "unreplicated corruption" is not quite the same thing as >>>> the write hole, because it isn't pernicious like the write hole. >>> >>> What is the difference to a)? Is write hole the worst issue? Judging >>> from the #brtfs channel discussions there seems to be other quite >>> severe issues, for example real data corruption risks in degraded mode. >>> >>>> c. A new de-clustered parity raid56 implementation that is not >>>> backwards compatible. >>> >>> Yes. We have a good opportunity to work out something much better >>> than current implementations. We could have redundant-n profiles >>> that also works with tired storage like ssd/nvme similar to the >>> metadata on ssd idea. >>> >>> Variable stripe width has been brought up before, but received cool >>> responses. Why is that? IMO it could improve random 4k ios by doing >>> equivalent to RAID1 instead of RMW, while also closing the write >>> hole. Perhaps there is a middle ground to be found? >>> >>> >>>> >>>> Ergo, I think it's best to not break the format twice. Even if a new >>>> raid implementation is years off. >>> >>> I very agree here. Btrfs already suffers in public opinion from the >>> lack of a stable and safe-for-data RAID56, and requiring several >>> non-compatible chances isn't going to help. >>> >>> I also think it's important that the 'temporary' changes actually >>> leads to a stable filesystem. Because what is the point otherwise? >>> >>> Thanks >>> Forza >>> >>>> >>>> Metadata centric workloads suck on parity raid anyway. If Btrfs always >>>> does full stripe COW won't matter even if the performance is worse >>>> because no one should use parity raid for this workload anyway. >>>> >>>> >>>> -- >>>> Chris Murphy >>> >>>
On Tue, Jul 19, 2022 at 09:19:21AM +0800, Qu Wenruo wrote: > > > > Doing so we don't need any disk format change and it would be backward compatible. > > > > Do we need to implement RAID56 in the traditional sense? As the > user/sysadmin I care about redundancy and performance and cost. The > option to create redundancy for any 'n drives is appealing from a cost > perspective, otherwise I'd use RAID1/10. > > Have you heard any recent problems related to dm-raid56? > > If your answer is no, then I guess we already have an answer to your > question. With plain dm-raid56 the problems were there since the beginning, so they're not recent. If there's a way to configure PPL or a journal device with raid5 LVs on LVM, I can't find it. AFAIK nobody who knows what they're doing would choose dm-raid56 for high-value data, especially when alternatives like ZFS exist. Before btrfs, we had a single-digit-percentage rate of severe data losses (more than 90% data lost) on filesystems and databases using mdadm + ext3/4 with no journal in degraded mode. Multiply by per-drive AFR and that's a lot of full system rebuilds over the years. > > Since the current RAID56 mode have several important drawbacks > > Let me to be clear: > > If you can ensure you didn't hit power loss, or after a power loss do a > scrub immediately before any new write, then current RAID56 is fine, at > least not obviously worse than dm-raid56. I'm told that scrub doesn't repair parity errors on btrfs. That was a thing I got wrong in my raid5 bug list from 2020. Scrub will fix data blocks if they have csum errors, but it will not detect or correct corruption in the parity blocks themselves. AFAICT the only way to get the parity blocks rewritten is to run something like balance, which carries risks of its own due to the sheer volume of IO from data and metadata updates. Most of the raid56 bugs I've identified have nothing to do with power loss. The data on disks is fine, but the kernel can't read it correctly in degraded mode, or the diagnostic data from scrub are clearly garbage. I noticed you and others have done some work here recently, so some of these issues might be fixed in 5.19. I haven't re-run my raid5 tests on post-5.18 kernels yet (there have been other bugs blocking testing). > (There are still common problems shared between both btrfs raid56 and > dm-raid56, like destructive-RMW) Yeah, that's one of the critical things to fix because btrfs is in a good position to do as well or better than dm-raid56. btrfs has definitely fallen behind the other available solutions in the 9 years since raid5 was first added to btrfs, as btrfs implements only the basic configuration of raid56 (no parity integrity or rmw journal) that is fully vulnerable to write hole and drive-side data corruption. > > - and that it's officially not recommended for production use - it > is a good idea to reconstruct new btrfs 'redundant-n' profiles that > doesn't have the inherent issues of traditional RAID. > > I'd say the complexity is hugely underestimated. I'd agree with that. e.g. some btrfs equivalent of ZFS raidZ (put parity blocks inline with extents during writes) is not much more complex to implement on btrfs than compression; however, the btrfs kernel code couldn't read compressed data correctly for 12 years out of its 14-year history, and nobody wants to wait another decade or more for raid5 to work. It seems to me the biggest problem with write hole fixes is that all the potential fixes have cost tradeoffs, and everybody wants to veto the fix that has a cost they don't like. We could implement multiple fix approaches at the same time, as AFAIK most of the proposed solutions are orthogonal to each other. e.g. a write-ahead log can safely enable RMW at a higher IO cost, while the allocator could place extents to avoid RMW and thereby avoid the logging cost as much as possible (paid for by a deferred relocation/garbage collection cost), and using both at the same time would combine both benefits. Both solutions can be used independently for filesystems at extreme ends of the performance/capacity spectrum (if the filesystem is never more than 50% full, then logging is all cost with no gain compared to allocator avoidance of RMW, while a filesystem that is always near full will have to journal writes and also throttle writes on the journal. > > For example a non-striped redundant-n profile as well as a striped redundant-n profile. > > Non-striped redundant-n profile is already so complex that I can't > figure out a working idea right now. > > But if there is such way, I'm pretty happy to consider. > > > > > > > > > My 2 cents... > > > > > > Regarding the current raid56 support, in order of preference: > > > > > > a. Fix the current bugs, without changing format. Zygo has an extensive list. > > > > I agree that relatively simple fixes should be made. But it seems we will need quite a large rewrite to solve all issues? Is there a minium viable option here? > > Nope. Just see my write-intent code, already have prototype (just needs > new scrub based recovery code at mount time) working. > > And based on my write-intent code, I don't think it's that hard to > implement a full journal. FWIW I think we can get a very usable btrfs raid5 with a small format change (add a journal for stripe RMW, though we might disagree about details of how it should be structured and used) and fixes to the read-repair and scrub problems. The read-side problems in btrfs raid5 were always much more severe than the write hole. As soon as a disk goes offline, the read-repair code is unable to read all the surviving data correctly, and the filesystem has to be kept inactive or data on the disks will be gradually corrupted as bad parity gets mixed with data and written back to the filesystem. A few of the problems will require a deeper redesign, but IMHO they're not important problems. e.g. scrub can't identify which drive is corrupted in all cases, because it has no csum on parity blocks. The current on-disk format needs every data block in the raid5 stripe to be occupied by a file with a csum so scrub can eliminate every other block as the possible source of mismatched parity. While this could be fixed by a future new raid5 profile (and/or csum tree) specifically designed to avoid this, it's not something I'd insist on having before deploying a fleet of btrfs raid5 boxes. Silent corruption failures are so rare on spinning disks that I'd use the feature maybe once a decade. Silent corruption due to a failing or overheating HBA chip will most likely affect multiple disks at once and trash the whole filesystem, so individual drive-level corruption reporting isn't helpful. > Thanks, > Qu > > > > > > b. Mostly fix the write hole, also without changing the format, by > > > only doing COW with full stripe writes. Yes you could somehow get > > > corrupt parity still and not know it until degraded operation produces > > > a bad reconstruction of data - but checksum will still catch that. > > > This kind of "unreplicated corruption" is not quite the same thing as > > > the write hole, because it isn't pernicious like the write hole. > > > > What is the difference to a)? Is write hole the worst issue? Judging from the #brtfs channel discussions there seems to be other quite severe issues, for example real data corruption risks in degraded mode. > > > > > c. A new de-clustered parity raid56 implementation that is not > > > backwards compatible. > > > > Yes. We have a good opportunity to work out something much better than current implementations. We could have redundant-n profiles that also works with tired storage like ssd/nvme similar to the metadata on ssd idea. > > > > Variable stripe width has been brought up before, but received cool responses. Why is that? IMO it could improve random 4k ios by doing equivalent to RAID1 instead of RMW, while also closing the write hole. Perhaps there is a middle ground to be found? > > > > > > > > > > Ergo, I think it's best to not break the format twice. Even if a new > > > raid implementation is years off. > > > > I very agree here. Btrfs already suffers in public opinion from the lack of a stable and safe-for-data RAID56, and requiring several non-compatible chances isn't going to help. > > > > I also think it's important that the 'temporary' changes actually leads to a stable filesystem. Because what is the point otherwise? > > > > Thanks > > Forza > > > > > > > > Metadata centric workloads suck on parity raid anyway. If Btrfs always > > > does full stripe COW won't matter even if the performance is worse > > > because no one should use parity raid for this workload anyway. > > > > > > > > > -- > > > Chris Murphy > > > >
On 2022/7/25 08:00, Zygo Blaxell wrote: > On Tue, Jul 19, 2022 at 09:19:21AM +0800, Qu Wenruo wrote: >>>>> Doing so we don't need any disk format change and it would be backward compatible. >>> >>> Do we need to implement RAID56 in the traditional sense? As the >> user/sysadmin I care about redundancy and performance and cost. The >> option to create redundancy for any 'n drives is appealing from a cost >> perspective, otherwise I'd use RAID1/10. >> >> Have you heard any recent problems related to dm-raid56? >> >> If your answer is no, then I guess we already have an answer to your >> question. > > With plain dm-raid56 the problems were there since the beginning, so > they're not recent. Are you talking about mdraid? They go internal write-intent bitmap and PPL by default. > If there's a way to configure PPL or a journal > device with raid5 LVs on LVM, I can't find it. LVM is another story. > AFAIK nobody who knows > what they're doing would choose dm-raid56 for high-value data, especially > when alternatives like ZFS exist. Isn't it the opposite? mdraid is what most people go, other than LVM raid. > > Before btrfs, we had a single-digit-percentage rate of severe data losses > (more than 90% data lost) on filesystems and databases using mdadm + > ext3/4 with no journal in degraded mode. Multiply by per-drive AFR > and that's a lot of full system rebuilds over the years. > >>> Since the current RAID56 mode have several important drawbacks >> >> Let me to be clear: >> >> If you can ensure you didn't hit power loss, or after a power loss do a >> scrub immediately before any new write, then current RAID56 is fine, at >> least not obviously worse than dm-raid56. > > I'm told that scrub doesn't repair parity errors on btrfs. That's totally untrue. You can easily verify that using "btrfs check --check-data-csum", as recent btrfs-progs has the extra code to verify the rebuilt data using parity. In fact, I'm testing my write-intent bitmaps code with manually corrupted parity to emulate a power loss after write-intent bitmaps update. And I must say, the scrub code works as expected. The myth may come from some bad advice on only scrubbing a single device for RAID56 to avoid duplicated IO. But the truth is, if only scrubbing one single device, for data stripes on that device, if no csum error detected, scrub won't check the parity or the other data stripes in the same vertical stripe. On the other hand, if scrub is checking the parity stripe, it will also check the csum for the data stripes in the same vertical stripe, and rewrite the parity if needed. > That was a > thing I got wrong in my raid5 bug list from 2020. Scrub will fix data > blocks if they have csum errors, but it will not detect or correct > corruption in the parity blocks themselves. That's exactly what I mentioned, the user is trying to be a smartass without knowing the details. Although I think we should enhance the man page to discourage the usage of single device scrub. By default, we scrub all devices (using mount point). > AFAICT the only way to > get the parity blocks rewritten is to run something like balance, > which carries risks of its own due to the sheer volume of IO from > data and metadata updates. Completely incorrect. > > Most of the raid56 bugs I've identified have nothing to do with power > loss. The data on disks is fine, but the kernel can't read it correctly > in degraded mode, or the diagnostic data from scrub are clearly garbage. Unable to read in degraded mode just means parity is out-of-sync with data. There are several other bugs related to this, mostly related to the cached raid bio and how we rebuild the data. (aka, btrfs/125) Thankfully I have submitted patches for that bug and now btrfs/125 should pass without problems. But the powerloss can still lead to out-of-sync parity and that's why I'm fixing the problem using write-intent-bitmaps. > > I noticed you and others have done some work here recently, so some of > these issues might be fixed in 5.19. I haven't re-run my raid5 tests > on post-5.18 kernels yet (there have been other bugs blocking testing). > >> (There are still common problems shared between both btrfs raid56 and >> dm-raid56, like destructive-RMW) > > Yeah, that's one of the critical things to fix because btrfs is in a good > position to do as well or better than dm-raid56. btrfs has definitely > fallen behind the other available solutions in the 9 years since raid5 was > first added to btrfs, as btrfs implements only the basic configuration > of raid56 (no parity integrity or rmw journal) that is fully vulnerable > to write hole and drive-side data corruption. > >>> - and that it's officially not recommended for production use - it >> is a good idea to reconstruct new btrfs 'redundant-n' profiles that >> doesn't have the inherent issues of traditional RAID. >> >> I'd say the complexity is hugely underestimated. > > I'd agree with that. e.g. some btrfs equivalent of ZFS raidZ (put parity > blocks inline with extents during writes) is not much more complex to > implement on btrfs than compression; however, the btrfs kernel code > couldn't read compressed data correctly for 12 years out of its 14-year > history, and nobody wants to wait another decade or more for raid5 > to work. > > It seems to me the biggest problem with write hole fixes is that all > the potential fixes have cost tradeoffs, and everybody wants to veto > the fix that has a cost they don't like. Well, that's why I prefer multiple solutions for end users to choose, other than really trying to get a silver bullet solution. (That's also why I'm recently trying to separate block group tree from extent tree v2, as I really believe progressive improvement over a death ball feature) Thanks, Qu > > We could implement multiple fix approaches at the same time, as AFAIK > most of the proposed solutions are orthogonal to each other. e.g. a > write-ahead log can safely enable RMW at a higher IO cost, while the > allocator could place extents to avoid RMW and thereby avoid the logging > cost as much as possible (paid for by a deferred relocation/garbage > collection cost), and using both at the same time would combine both > benefits. Both solutions can be used independently for filesystems at > extreme ends of the performance/capacity spectrum (if the filesystem is > never more than 50% full, then logging is all cost with no gain compared > to allocator avoidance of RMW, while a filesystem that is always near > full will have to journal writes and also throttle writes on the journal. > >>> For example a non-striped redundant-n profile as well as a striped redundant-n profile. >> >> Non-striped redundant-n profile is already so complex that I can't >> figure out a working idea right now. >> >> But if there is such way, I'm pretty happy to consider. >> >>> >>>> >>>> My 2 cents... >>>> >>>> Regarding the current raid56 support, in order of preference: >>>> >>>> a. Fix the current bugs, without changing format. Zygo has an extensive list. >>> >>> I agree that relatively simple fixes should be made. But it seems we will need quite a large rewrite to solve all issues? Is there a minium viable option here? >> >> Nope. Just see my write-intent code, already have prototype (just needs >> new scrub based recovery code at mount time) working. >> >> And based on my write-intent code, I don't think it's that hard to >> implement a full journal. > > FWIW I think we can get a very usable btrfs raid5 with a small format > change (add a journal for stripe RMW, though we might disagree about > details of how it should be structured and used) and fixes to the > read-repair and scrub problems. The read-side problems in btrfs raid5 > were always much more severe than the write hole. As soon as a disk > goes offline, the read-repair code is unable to read all the surviving > data correctly, and the filesystem has to be kept inactive or data on > the disks will be gradually corrupted as bad parity gets mixed with data > and written back to the filesystem. > > A few of the problems will require a deeper redesign, but IMHO they're not > important problems. e.g. scrub can't identify which drive is corrupted > in all cases, because it has no csum on parity blocks. The current > on-disk format needs every data block in the raid5 stripe to be occupied > by a file with a csum so scrub can eliminate every other block as the > possible source of mismatched parity. While this could be fixed by > a future new raid5 profile (and/or csum tree) specifically designed > to avoid this, it's not something I'd insist on having before deploying > a fleet of btrfs raid5 boxes. Silent corruption failures are so > rare on spinning disks that I'd use the feature maybe once a decade. > Silent corruption due to a failing or overheating HBA chip will most > likely affect multiple disks at once and trash the whole filesystem, > so individual drive-level corruption reporting isn't helpful. > >> Thanks, >> Qu >> >>> >>>> b. Mostly fix the write hole, also without changing the format, by >>>> only doing COW with full stripe writes. Yes you could somehow get >>>> corrupt parity still and not know it until degraded operation produces >>>> a bad reconstruction of data - but checksum will still catch that. >>>> This kind of "unreplicated corruption" is not quite the same thing as >>>> the write hole, because it isn't pernicious like the write hole. >>> >>> What is the difference to a)? Is write hole the worst issue? Judging from the #brtfs channel discussions there seems to be other quite severe issues, for example real data corruption risks in degraded mode. >>> >>>> c. A new de-clustered parity raid56 implementation that is not >>>> backwards compatible. >>> >>> Yes. We have a good opportunity to work out something much better than current implementations. We could have redundant-n profiles that also works with tired storage like ssd/nvme similar to the metadata on ssd idea. >>> >>> Variable stripe width has been brought up before, but received cool responses. Why is that? IMO it could improve random 4k ios by doing equivalent to RAID1 instead of RMW, while also closing the write hole. Perhaps there is a middle ground to be found? >>> >>> >>>> >>>> Ergo, I think it's best to not break the format twice. Even if a new >>>> raid implementation is years off. >>> >>> I very agree here. Btrfs already suffers in public opinion from the lack of a stable and safe-for-data RAID56, and requiring several non-compatible chances isn't going to help. >>> >>> I also think it's important that the 'temporary' changes actually leads to a stable filesystem. Because what is the point otherwise? >>> >>> Thanks >>> Forza >>> >>>> >>>> Metadata centric workloads suck on parity raid anyway. If Btrfs always >>>> does full stripe COW won't matter even if the performance is worse >>>> because no one should use parity raid for this workload anyway. >>>> >>>> >>>> -- >>>> Chris Murphy >>> >>>
On Mon, Jul 25, 2022 at 08:25:44AM +0800, Qu Wenruo wrote: > > > On 2022/7/25 08:00, Zygo Blaxell wrote: > > On Tue, Jul 19, 2022 at 09:19:21AM +0800, Qu Wenruo wrote: > > > > > > Doing so we don't need any disk format change and it would be backward compatible. > > > > > > > > Do we need to implement RAID56 in the traditional sense? As the > > > user/sysadmin I care about redundancy and performance and cost. The > > > option to create redundancy for any 'n drives is appealing from a cost > > > perspective, otherwise I'd use RAID1/10. > > > > > > Have you heard any recent problems related to dm-raid56? > > > > > > If your answer is no, then I guess we already have an answer to your > > > question. > > > > With plain dm-raid56 the problems were there since the beginning, so > > they're not recent. > > Are you talking about mdraid? They go internal write-intent bitmap and > PPL by default. resync is the default for mdadm raid5, not PPL. Write-intent and PPL are mutually exclusive options. mdadm raid5 doesn't default to bitmap either. (Verified with mdadm v4.2 - 2021-12-30). > > If there's a way to configure PPL or a journal > > device with raid5 LVs on LVM, I can't find it. > > LVM is another story. > > > AFAIK nobody who knows > > what they're doing would choose dm-raid56 for high-value data, especially > > when alternatives like ZFS exist. > > Isn't it the opposite? mdraid is what most people go, other than LVM raid. You said dm-raid, so I thought we were talking about dm-raid here. It's a different interface to the core mdadm raid code, so the practical differences between dm-raid and md-raid for most users are in what lvm exposes (or does not expose). > > Before btrfs, we had a single-digit-percentage rate of severe data losses > > (more than 90% data lost) on filesystems and databases using mdadm + > > ext3/4 with no journal in degraded mode. Multiply by per-drive AFR > > and that's a lot of full system rebuilds over the years. > > > > > > Since the current RAID56 mode have several important drawbacks > > > > > > Let me to be clear: > > > > > > If you can ensure you didn't hit power loss, or after a power loss do a > > > scrub immediately before any new write, then current RAID56 is fine, at > > > least not obviously worse than dm-raid56. > > > > I'm told that scrub doesn't repair parity errors on btrfs. > > That's totally untrue. > > You can easily verify that using "btrfs check --check-data-csum", as > recent btrfs-progs has the extra code to verify the rebuilt data using > parity. > > In fact, I'm testing my write-intent bitmaps code with manually > corrupted parity to emulate a power loss after write-intent bitmaps update. > > And I must say, the scrub code works as expected. That's good, but if it's true, it's a (welcome) change since last week. Every time I've run a raid5 repair test with a single corrupted disk, there has been some lost data, both from scrub and reads. 5.18.12 today behaves the way I'm used to, with read repair unable to repair csum errors and scrub leaving a few uncorrected blocks behind. > The myth may come from some bad advice on only scrubbing a single device > for RAID56 to avoid duplicated IO. > > But the truth is, if only scrubbing one single device, for data stripes > on that device, if no csum error detected, scrub won't check the parity > or the other data stripes in the same vertical stripe. > > On the other hand, if scrub is checking the parity stripe, it will also > check the csum for the data stripes in the same vertical stripe, and > rewrite the parity if needed. > > > That was a > > thing I got wrong in my raid5 bug list from 2020. Scrub will fix data > > blocks if they have csum errors, but it will not detect or correct > > corruption in the parity blocks themselves. > > That's exactly what I mentioned, the user is trying to be a smartass > without knowing the details. > > Although I think we should enhance the man page to discourage the usage > of single device scrub. If we have something better to replace it now, sure. The reason for running the scrub on devices sequentially was because it behaved so terribly when the per-device threads ran in parallel. If scrub is now behaving differently on raid56 then the man page should be updated to reflect that. > By default, we scrub all devices (using mount point). The scrub userspace code enumerates the devices and runs a separate thread to scrub each one. Running them on one device at a time makes those threads run sequentially instead of in parallel, and avoids a lot of bad stuff with competing disk accesses and race conditions. See below for a recent example. > > AFAICT the only way to > > get the parity blocks rewritten is to run something like balance, > > which carries risks of its own due to the sheer volume of IO from > > data and metadata updates. > > Completely incorrect. And yet consistent with testing evidence going back 6 years so far. If scrub works, it should be possible to corrupt one drive, scrub, then corrupt the other drive, scrub again, and have zero errors and zero kernel crashes. Instead: # mkfs.btrfs -draid5 -mraid1 -f /dev/vdb /dev/vdc # mount -ospace_cache=v2,compress=zstd /dev/vdb /testfs # cp -a /testdata/. /testfs/. & # 40TB of files, average size 23K [...wait a few minutes for some data, we don't need the whole thing...] # compsize /testfs/. Processed 15271 files, 7901 regular extents (7909 refs), 6510 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 73% 346M 472M 473M none 100% 253M 253M 253M zstd 42% 92M 219M 219M # cat /dev/zero > /dev/vdb # sync # btrfs scrub start /dev/vdb # or '/testfs', doesn't matter # cat /dev/zero > /dev/vdc # sync # btrfs scrub start /dev/vdc # or '/testfs', doesn't matter ERROR: there are uncorrectable errors # btrfs scrub status -d . UUID: 8237e122-35af-40ef-80bc-101693e878e3 Scrub device /dev/vdb (id 1) no stats available Scrub device /dev/vdc (id 2) history Scrub started: Mon Jul 25 00:02:25 2022 Status: finished Duration: 0:00:22 Total to scrub: 2.01GiB Rate: 1.54MiB/s Error summary: csum=1690 Corrected: 1032 Uncorrectable: 658 Unverified: 0 # cat /proc/version Linux version 5.19.0-ba37a9d53d71-for-next+ (zblaxell@tester) (gcc (Debian 11.3.0-3) 11.3.0, GNU ld (GNU Binutils for Debian) 2.38) #82 SMP PREEMPT_DYNAMIC Sun Jul 24 15:12:57 EDT 2022 Running scrub threads in parallel sometimes triggers stuff like this, which killed one of the test runs while I was writing this: [ 1304.696921] BTRFS info (device vdb): read error corrected: ino 411 off 135168 (dev /dev/vdb sector 3128840) [ 1304.697705] BTRFS info (device vdb): read error corrected: ino 411 off 139264 (dev /dev/vdb sector 3128848) [ 1304.701196] ================================================================== [ 1304.716463] ------------[ cut here ]------------ [ 1304.717094] BUG: KFENCE: use-after-free read in free_io_failure+0x157/0x210 [ 1304.723346] kernel BUG at fs/btrfs/extent_io.c:2350! [ 1304.725076] Use-after-free read at 0x000000001e0043a6 (in kfence-#228): [ 1304.725103] free_io_failure+0x157/0x210 [ 1304.725115] clean_io_failure+0x11d/0x260 [ 1304.725126] end_compressed_bio_read+0x2a9/0x470 [ 1304.727698] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 1304.729516] bio_endio+0x361/0x3c0 [ 1304.731048] CPU: 1 PID: 12615 Comm: kworker/u8:10 Not tainted 5.19.0-ba37a9d53d71-for-next+ #82 d82f965b2e84525cfbba07129899b46c497cda69 [ 1304.733084] rbio_orig_end_io+0x127/0x1c0 [ 1304.736876] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 1304.738720] __raid_recover_end_io+0x405/0x8f0 [ 1304.740310] Workqueue: btrfs-endio btrfs_end_bio_work [ 1304.748199] raid_recover_end_io_work+0x8c/0xb0 [ 1304.750028] RIP: 0010:repair_io_failure+0x359/0x4b0 [ 1304.752434] process_one_work+0x4e5/0xaa0 [ 1304.752449] worker_thread+0x32e/0x720 [ 1304.754214] Code: 2b e8 2b 2f 79 ff 48 c7 c6 70 06 ac 91 48 c7 c7 00 b9 14 94 e8 38 00 73 ff 48 8d bd 48 ff ff ff e8 8c 7a 26 00 e9 f6 fd ff ff <0f> 0b e8 10 be 5e 01 85 c0 74 cc 48 c7 c7 f0 1c 45 94 e8 30 ab 98 [ 1304.756561] kthread+0x1ab/0x1e0 [ 1304.758398] RSP: 0018:ffffa429c6adbb10 EFLAGS: 00010246 [ 1304.759278] ret_from_fork+0x22/0x30 [ 1304.761343] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 1304.762308] kfence-#228: 0x00000000cc0e17b4-0x0000000004ce48de, size=48, cache=kmalloc-64 [ 1304.763692] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1304.764649] allocated by task 12615 on cpu 1 at 1304.670070s: [ 1304.765617] RBP: ffffa429c6adbc08 R08: 0000000000000000 R09: 0000000000000000 [ 1304.766421] btrfs_repair_one_sector+0x370/0x500 [ 1304.767638] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9108baaec000 [ 1304.768341] end_compressed_bio_read+0x187/0x470 [ 1304.770163] R13: 0000000000000000 R14: ffffe44885d55040 R15: ffff9108114e66a4 [ 1304.770993] bio_endio+0x361/0x3c0 [ 1304.772226] FS: 0000000000000000(0000) GS:ffff9109b7200000(0000) knlGS:0000000000000000 [ 1304.773128] btrfs_end_bio_work+0x1f/0x30 [ 1304.773914] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1304.774856] process_one_work+0x4e5/0xaa0 [ 1304.774869] worker_thread+0x32e/0x720 [ 1304.775172] CR2: 00007fb7a88c1738 CR3: 00000000bc03e002 CR4: 0000000000170ee0 [ 1304.776397] kthread+0x1ab/0x1e0 [ 1304.776429] ret_from_fork+0x22/0x30 [ 1304.778282] Call Trace: [ 1304.779009] freed by task 21948 on cpu 2 at 1304.694620s: [ 1304.781760] <TASK> [ 1304.782419] free_io_failure+0x19a/0x210 [ 1304.783213] ? __bio_clone+0x1c0/0x1c0 [ 1304.783952] clean_io_failure+0x11d/0x260 [ 1304.783963] end_compressed_bio_read+0x2a9/0x470 [ 1304.784263] clean_io_failure+0x21a/0x260 [ 1304.785674] bio_endio+0x361/0x3c0 [ 1304.785995] end_compressed_bio_read+0x2a9/0x470 [ 1304.787645] btrfs_end_bio_work+0x1f/0x30 [ 1304.788597] bio_endio+0x361/0x3c0 [ 1304.789674] process_one_work+0x4e5/0xaa0 [ 1304.790786] btrfs_end_bio_work+0x1f/0x30 [ 1304.791776] worker_thread+0x32e/0x720 [ 1304.791788] kthread+0x1ab/0x1e0 [ 1304.792895] process_one_work+0x4e5/0xaa0 [ 1304.793882] ret_from_fork+0x22/0x30 [ 1304.795043] worker_thread+0x32e/0x720 [ 1304.795802] CPU: 3 PID: 12616 Comm: kworker/u8:11 Not tainted 5.19.0-ba37a9d53d71-for-next+ #82 d82f965b2e84525cfbba07129899b46c497cda69 [ 1304.796945] ? _raw_spin_unlock_irqrestore+0x7d/0xa0 [ 1304.797662] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 1304.798453] ? process_one_work+0xaa0/0xaa0 [ 1304.799175] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work [ 1304.799739] kthread+0x1ab/0x1e0 [ 1304.801288] ================================================================== [ 1304.801873] ? kthread_complete_and_exit+0x40/0x40 [ 1304.809362] ================================================================== [ 1304.809933] ret_from_fork+0x22/0x30 [ 1304.809977] </TASK> [ 1304.809982] Modules linked in: [ 1304.810068] ---[ end trace 0000000000000000 ]--- [ 1304.810079] RIP: 0010:repair_io_failure+0x359/0x4b0 [ 1304.810092] Code: 2b e8 2b 2f 79 ff 48 c7 c6 70 06 ac 91 48 c7 c7 00 b9 14 94 e8 38 00 73 ff 48 8d bd 48 ff ff ff e8 8c 7a 26 00 e9 f6 fd ff ff <0f> 0b e8 10 be 5e 01 85 c0 74 cc 48 c7 c7 f0 1c 45 94 e8 30 ab 98 [ 1304.810114] RSP: 0018:ffffa429c6adbb10 EFLAGS: 00010246 [ 1304.810125] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 1304.810133] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1304.810140] RBP: ffffa429c6adbc08 R08: 0000000000000000 R09: 0000000000000000 [ 1304.810149] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9108baaec000 [ 1304.810157] R13: 0000000000000000 R14: ffffe44885d55040 R15: ffff9108114e66a4 [ 1304.810165] FS: 0000000000000000(0000) GS:ffff9109b7200000(0000) knlGS:0000000000000000 [ 1304.810175] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1304.810184] CR2: 00007fb7a88c1738 CR3: 00000000bc03e002 CR4: 0000000000170ee0 [ 1304.903432] BUG: KFENCE: invalid free in free_io_failure+0x19a/0x210 [ 1304.906815] Invalid free of 0x00000000cc0e17b4 (in kfence-#228): [ 1304.909006] free_io_failure+0x19a/0x210 [ 1304.909666] clean_io_failure+0x11d/0x260 [ 1304.910358] end_compressed_bio_read+0x2a9/0x470 [ 1304.911121] bio_endio+0x361/0x3c0 [ 1304.911722] rbio_orig_end_io+0x127/0x1c0 [ 1304.912405] __raid_recover_end_io+0x405/0x8f0 [ 1304.919917] raid_recover_end_io_work+0x8c/0xb0 [ 1304.927494] process_one_work+0x4e5/0xaa0 [ 1304.934191] worker_thread+0x32e/0x720 [ 1304.940524] kthread+0x1ab/0x1e0 [ 1304.945963] ret_from_fork+0x22/0x30 [ 1304.953057] kfence-#228: 0x00000000cc0e17b4-0x0000000004ce48de, size=48, cache=kmalloc-64 [ 1304.955733] allocated by task 12615 on cpu 1 at 1304.670070s: [ 1304.957225] btrfs_repair_one_sector+0x370/0x500 [ 1304.958574] end_compressed_bio_read+0x187/0x470 [ 1304.959937] bio_endio+0x361/0x3c0 [ 1304.960960] btrfs_end_bio_work+0x1f/0x30 [ 1304.962193] process_one_work+0x4e5/0xaa0 [ 1304.963403] worker_thread+0x32e/0x720 [ 1304.965498] kthread+0x1ab/0x1e0 [ 1304.966515] ret_from_fork+0x22/0x30 [ 1304.968681] freed by task 21948 on cpu 2 at 1304.694620s: [ 1304.970160] free_io_failure+0x19a/0x210 [ 1304.971725] clean_io_failure+0x11d/0x260 [ 1304.973082] end_compressed_bio_read+0x2a9/0x470 [ 1304.974277] bio_endio+0x361/0x3c0 [ 1304.975245] btrfs_end_bio_work+0x1f/0x30 [ 1304.976623] process_one_work+0x4e5/0xaa0 [ 1304.979141] worker_thread+0x32e/0x720 [ 1304.980044] kthread+0x1ab/0x1e0 [ 1304.981002] ret_from_fork+0x22/0x30 [ 1304.982520] CPU: 2 PID: 12616 Comm: kworker/u8:11 Tainted: G B D 5.19.0-ba37a9d53d71-for-next+ #82 d82f965b2e84525cfbba07129899b46c497cda69 [ 1304.986522] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 1304.988636] Workqueue: btrfs-endio-raid56 raid_recover_end_io_work [ 1304.990234] ================================================================== On kernels without KASAN or page poisoning, that use-after-free might lead to a hang at the end of a btrfs replace. I don't know exactly what's going on there--there is often a hang at the end of a raid5 replace, it's caused by a mismatch between the count of active bios and the actual number of active bios, and a use-after-free might be causing that by forgetting to decrement the counter. There are multiple overlapping bugs in btrfs raid5 and it's hard to reliably separate them until some of them get fixed. Another data point: I ran 5 test runs while writing this, and the third one did fix all the errors in scrub. It sometimes does happen over test cases of a few gigabytes. It's just not anywhere near reliable enough to fix a 50TB array with one busted disk. I think you need better test cases. btrfs raid5 has been broken like this since the beginning, with failures that can be demonstrated in minutes. btrfs raid1 can run these tests all day. > > Most of the raid56 bugs I've identified have nothing to do with power > > loss. The data on disks is fine, but the kernel can't read it correctly > > in degraded mode, or the diagnostic data from scrub are clearly garbage. > > Unable to read in degraded mode just means parity is out-of-sync with data. No, the degraded mode case is different. It has distinct behavior from the above test case where all the drives are online but csums are failing. In degraded mode one of the devices is unavailable, so the read code is trying to reconstruct data on the fly. The parity and data on disk is often OK on the surviving disks if I dump it out by hand, and often all the data can be recovered by 'btrfs replace' without error (as long as 'btrfs replace' is the only active process on the filesystem). Rebooting the test VM will make a different set of data unreadable through the filesystem, and the set of unreadable blocks changes over time if running something like: sysctl vm.drop_caches=3; find -type f -exec cat {} + >/dev/null in a loop, especially if something is writing to the filesystem at the same time. Note there is never a write hole in these test cases--the filesystem is always cleanly umounted, and sometimes there's no umount at all, one device is simply disconnected with no umount or reboot. > There are several other bugs related to this, mostly related to the > cached raid bio and how we rebuild the data. (aka, btrfs/125) > Thankfully I have submitted patches for that bug and now btrfs/125 > should pass without problems. xfstests btrfs/125 is an extremely simple test case. I'm using btrfs raid5 on 20-80TB filesystems, millions to billions of files. The error rate is quantitatively low (only 0.01% of data is lost after one disk failure) but it should be zero, as none of my test cases involve write hole, nodatacow, or raid5 metadata. for-next and misc-next are still quite broken, though to be fair they definitely have issues beyond raid5. 5.18.12 can get through the test without tripping over KASAN or blowing up the metadata, but it has uncorrectable errors and fake read errors: # btrfs scrub start -Bd /testfs/ Scrub device /dev/vdb (id 1) done Scrub started: Mon Jul 25 00:49:28 2022 Status: finished Duration: 0:03:03 Total to scrub: 4.01GiB Rate: 1.63MiB/s Error summary: read=3 csum=7578 Corrected: 7577 Uncorrectable: 4 Unverified: 1 I know the read errors are fake because /dev/vdb is a file on a tmpfs. > But the powerloss can still lead to out-of-sync parity and that's why > I'm fixing the problem using write-intent-bitmaps. None of my test cases involve write hole, as I know write-hole test cases will always fail. There's no point in testing write hole if recovery from much simpler failures isn't working yet. > > I noticed you and others have done some work here recently, so some of > > these issues might be fixed in 5.19. I haven't re-run my raid5 tests > > on post-5.18 kernels yet (there have been other bugs blocking testing). > > > > > (There are still common problems shared between both btrfs raid56 and > > > dm-raid56, like destructive-RMW) > > > > Yeah, that's one of the critical things to fix because btrfs is in a good > > position to do as well or better than dm-raid56. btrfs has definitely > > fallen behind the other available solutions in the 9 years since raid5 was > > first added to btrfs, as btrfs implements only the basic configuration > > of raid56 (no parity integrity or rmw journal) that is fully vulnerable > > to write hole and drive-side data corruption. > > > > > > - and that it's officially not recommended for production use - it > > > is a good idea to reconstruct new btrfs 'redundant-n' profiles that > > > doesn't have the inherent issues of traditional RAID. > > > > > > I'd say the complexity is hugely underestimated. > > > > I'd agree with that. e.g. some btrfs equivalent of ZFS raidZ (put parity > > blocks inline with extents during writes) is not much more complex to > > implement on btrfs than compression; however, the btrfs kernel code > > couldn't read compressed data correctly for 12 years out of its 14-year > > history, and nobody wants to wait another decade or more for raid5 > > to work. > > > > It seems to me the biggest problem with write hole fixes is that all > > the potential fixes have cost tradeoffs, and everybody wants to veto > > the fix that has a cost they don't like. > > Well, that's why I prefer multiple solutions for end users to choose, > other than really trying to get a silver bullet solution. > > (That's also why I'm recently trying to separate block group tree from > extent tree v2, as I really believe progressive improvement over a death > ball feature) Yeah I'm definitely in favor of getting bgtree done sooner rather than later. It's a simple, stand-alone feature that has well known beneficial effect. If the extent tree v2 project wants to do something incompatible with it later on, that's extent tree v2's problem, not a reason to block bgtree in the short term. > Thanks, > Qu > > > > > We could implement multiple fix approaches at the same time, as AFAIK > > most of the proposed solutions are orthogonal to each other. e.g. a > > write-ahead log can safely enable RMW at a higher IO cost, while the > > allocator could place extents to avoid RMW and thereby avoid the logging > > cost as much as possible (paid for by a deferred relocation/garbage > > collection cost), and using both at the same time would combine both > > benefits. Both solutions can be used independently for filesystems at > > extreme ends of the performance/capacity spectrum (if the filesystem is > > never more than 50% full, then logging is all cost with no gain compared > > to allocator avoidance of RMW, while a filesystem that is always near > > full will have to journal writes and also throttle writes on the journal. > > > > > > For example a non-striped redundant-n profile as well as a striped redundant-n profile. > > > > > > Non-striped redundant-n profile is already so complex that I can't > > > figure out a working idea right now. > > > > > > But if there is such way, I'm pretty happy to consider. > > > > > > > > > > > > > > > > > My 2 cents... > > > > > > > > > > Regarding the current raid56 support, in order of preference: > > > > > > > > > > a. Fix the current bugs, without changing format. Zygo has an extensive list. > > > > > > > > I agree that relatively simple fixes should be made. But it seems we will need quite a large rewrite to solve all issues? Is there a minium viable option here? > > > > > > Nope. Just see my write-intent code, already have prototype (just needs > > > new scrub based recovery code at mount time) working. > > > > > > And based on my write-intent code, I don't think it's that hard to > > > implement a full journal. > > > > FWIW I think we can get a very usable btrfs raid5 with a small format > > change (add a journal for stripe RMW, though we might disagree about > > details of how it should be structured and used) and fixes to the > > read-repair and scrub problems. The read-side problems in btrfs raid5 > > were always much more severe than the write hole. As soon as a disk > > goes offline, the read-repair code is unable to read all the surviving > > data correctly, and the filesystem has to be kept inactive or data on > > the disks will be gradually corrupted as bad parity gets mixed with data > > and written back to the filesystem. > > > > A few of the problems will require a deeper redesign, but IMHO they're not > > important problems. e.g. scrub can't identify which drive is corrupted > > in all cases, because it has no csum on parity blocks. The current > > on-disk format needs every data block in the raid5 stripe to be occupied > > by a file with a csum so scrub can eliminate every other block as the > > possible source of mismatched parity. While this could be fixed by > > a future new raid5 profile (and/or csum tree) specifically designed > > to avoid this, it's not something I'd insist on having before deploying > > a fleet of btrfs raid5 boxes. Silent corruption failures are so > > rare on spinning disks that I'd use the feature maybe once a decade. > > Silent corruption due to a failing or overheating HBA chip will most > > likely affect multiple disks at once and trash the whole filesystem, > > so individual drive-level corruption reporting isn't helpful. > > > > > Thanks, > > > Qu > > > > > > > > > > > > b. Mostly fix the write hole, also without changing the format, by > > > > > only doing COW with full stripe writes. Yes you could somehow get > > > > > corrupt parity still and not know it until degraded operation produces > > > > > a bad reconstruction of data - but checksum will still catch that. > > > > > This kind of "unreplicated corruption" is not quite the same thing as > > > > > the write hole, because it isn't pernicious like the write hole. > > > > > > > > What is the difference to a)? Is write hole the worst issue? Judging from the #brtfs channel discussions there seems to be other quite severe issues, for example real data corruption risks in degraded mode. > > > > > > > > > c. A new de-clustered parity raid56 implementation that is not > > > > > backwards compatible. > > > > > > > > Yes. We have a good opportunity to work out something much better than current implementations. We could have redundant-n profiles that also works with tired storage like ssd/nvme similar to the metadata on ssd idea. > > > > > > > > Variable stripe width has been brought up before, but received cool responses. Why is that? IMO it could improve random 4k ios by doing equivalent to RAID1 instead of RMW, while also closing the write hole. Perhaps there is a middle ground to be found? > > > > > > > > > > > > > > > > > > Ergo, I think it's best to not break the format twice. Even if a new > > > > > raid implementation is years off. > > > > > > > > I very agree here. Btrfs already suffers in public opinion from the lack of a stable and safe-for-data RAID56, and requiring several non-compatible chances isn't going to help. > > > > > > > > I also think it's important that the 'temporary' changes actually leads to a stable filesystem. Because what is the point otherwise? > > > > > > > > Thanks > > > > Forza > > > > > > > > > > > > > > Metadata centric workloads suck on parity raid anyway. If Btrfs always > > > > > does full stripe COW won't matter even if the performance is worse > > > > > because no one should use parity raid for this workload anyway. > > > > > > > > > > > > > > > -- > > > > > Chris Murphy > > > > > > > > >
On 2022/7/25 13:41, Zygo Blaxell wrote: > On Mon, Jul 25, 2022 at 08:25:44AM +0800, Qu Wenruo wrote: [...] >> >> You can easily verify that using "btrfs check --check-data-csum", as >> recent btrfs-progs has the extra code to verify the rebuilt data using >> parity. >> >> In fact, I'm testing my write-intent bitmaps code with manually >> corrupted parity to emulate a power loss after write-intent bitmaps update. >> >> And I must say, the scrub code works as expected. > > That's good, but if it's true, it's a (welcome) change since last week. > Every time I've run a raid5 repair test with a single corrupted disk, > there has been some lost data, both from scrub and reads. 5.18.12 today > behaves the way I'm used to, with read repair unable to repair csum > errors and scrub leaving a few uncorrected blocks behind. Have you tried misc-next? The following patches are not yet in upstream nor backported: btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() btrfs: update stripe_sectors::uptodate in steal_rbio btrfs: only write the sectors in the vertical stripe which has data stripes > >> The myth may come from some bad advice on only scrubbing a single device >> for RAID56 to avoid duplicated IO. >> >> But the truth is, if only scrubbing one single device, for data stripes >> on that device, if no csum error detected, scrub won't check the parity >> or the other data stripes in the same vertical stripe. >> >> On the other hand, if scrub is checking the parity stripe, it will also >> check the csum for the data stripes in the same vertical stripe, and >> rewrite the parity if needed. >> >>> That was a >>> thing I got wrong in my raid5 bug list from 2020. Scrub will fix data >>> blocks if they have csum errors, but it will not detect or correct >>> corruption in the parity blocks themselves. >> >> That's exactly what I mentioned, the user is trying to be a smartass >> without knowing the details. >> >> Although I think we should enhance the man page to discourage the usage >> of single device scrub. > > If we have something better to replace it now, sure. The reason for > running the scrub on devices sequentially was because it behaved so > terribly when the per-device threads ran in parallel. Really? For mirror/stripe based profiles they should be fine. Since each device scrubbing is only doing IO from that device (if no rebuild is needed). Although things like extent and csum tree iteration would cause some conflicts, I don't think that would be a big problem as tree block caching should work pretty well. It's RAID56 we're doing racing between each other, as for parity scrubbing, we will do extra IO from data stripes, thus it will cause performance problems. To that aspect, we indeed need a better interface for RAID56 scrubbing. But that's RAID56 only. > If scrub is now > behaving differently on raid56 then the man page should be updated to > reflect that. > >> By default, we scrub all devices (using mount point). > > The scrub userspace code enumerates the devices and runs a separate > thread to scrub each one. Running them on one device at a time makes > those threads run sequentially instead of in parallel, and avoids a > lot of bad stuff with competing disk accesses and race conditions. > See below for a recent example. > >>> AFAICT the only way to >>> get the parity blocks rewritten is to run something like balance, >>> which carries risks of its own due to the sheer volume of IO from >>> data and metadata updates. >> >> Completely incorrect. > > And yet consistent with testing evidence going back 6 years so far. > > If scrub works, it should be possible to corrupt one drive, scrub, > then corrupt the other drive, scrub again, and have zero errors > and zero kernel crashes. Instead: > > # mkfs.btrfs -draid5 -mraid1 -f /dev/vdb /dev/vdc > # mount -ospace_cache=v2,compress=zstd /dev/vdb /testfs > # cp -a /testdata/. /testfs/. & # 40TB of files, average size 23K > > [...wait a few minutes for some data, we don't need the whole thing...] > > # compsize /testfs/. > Processed 15271 files, 7901 regular extents (7909 refs), 6510 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 73% 346M 472M 473M > none 100% 253M 253M 253M > zstd 42% 92M 219M 219M > > # cat /dev/zero > /dev/vdb > # sync > # btrfs scrub start /dev/vdb # or '/testfs', doesn't matter > # cat /dev/zero > /dev/vdc > # sync > > # btrfs scrub start /dev/vdc # or '/testfs', doesn't matter > ERROR: there are uncorrectable errors > # btrfs scrub status -d . > UUID: 8237e122-35af-40ef-80bc-101693e878e3 > > Scrub device /dev/vdb (id 1) > no stats available > > Scrub device /dev/vdc (id 2) history > Scrub started: Mon Jul 25 00:02:25 2022 > Status: finished > Duration: 0:00:22 > Total to scrub: 2.01GiB > Rate: 1.54MiB/s > Error summary: csum=1690 > Corrected: 1032 > Uncorrectable: 658 > Unverified: 0 > # cat /proc/version > Linux version 5.19.0-ba37a9d53d71-for-next+ (zblaxell@tester) (gcc (Debian 11.3.0-3) 11.3.0, GNU ld (GNU Binutils for Debian) 2.38) #82 SMP PREEMPT_DYNAMIC Sun Jul 24 15:12:57 EDT 2022 > > Running scrub threads in parallel sometimes triggers stuff like this, > which killed one of the test runs while I was writing this: > > [ 1304.696921] BTRFS info (device vdb): read error corrected: ino 411 off 135168 (dev /dev/vdb sector 3128840) > [ 1304.697705] BTRFS info (device vdb): read error corrected: ino 411 off 139264 (dev /dev/vdb sector 3128848) > [ 1304.701196] ================================================================== > [ 1304.716463] ------------[ cut here ]------------ > [ 1304.717094] BUG: KFENCE: use-after-free read in free_io_failure+0x157/0x210 > > [ 1304.723346] kernel BUG at fs/btrfs/extent_io.c:2350! > [ 1304.725076] Use-after-free read at 0x000000001e0043a6 (in kfence-#228): > [ 1304.725103] free_io_failure+0x157/0x210 > [ 1304.725115] clean_io_failure+0x11d/0x260 > [ 1304.725126] end_compressed_bio_read+0x2a9/0x470 This looks like a problem relater to read-repair code with compression, HCH is also working on this, in the long run we would get rid of io failure record completely. Have you tried without compression? [...] > > On kernels without KASAN or page poisoning, that use-after-free might lead > to a hang at the end of a btrfs replace. I don't know exactly what's > going on there--there is often a hang at the end of a raid5 replace, > it's caused by a mismatch between the count of active bios and the actual > number of active bios, and a use-after-free might be causing that by > forgetting to decrement the counter. There are multiple overlapping > bugs in btrfs raid5 and it's hard to reliably separate them until some > of them get fixed. > > Another data point: I ran 5 test runs while writing this, and the third > one did fix all the errors in scrub. It sometimes does happen over test > cases of a few gigabytes. It's just not anywhere near reliable enough > to fix a 50TB array with one busted disk. > > I think you need better test cases. btrfs raid5 has been broken like this > since the beginning, with failures that can be demonstrated in minutes. > btrfs raid1 can run these tests all day. I'd say, the compression is adding a completely different complexity into the equation. Yep, for sysadmin's point of view, this is completely fine, but to us to locate the problems, I'd prefer something without compression, just RAID56 and plain file operations to see if it's really compression or other things screwed up. Remember, for read time repair, btrfs is still just trying the next mirror, no difference than RAID1. It's the RAID56 recovery code converting the 2nd mirror by using extra P/Q and data stripes to rebuild the data. We had problems which can cause exact the same unrepairable problems in the RAID56 repair path (not read-repair path), in that case, please try misc-next to see if there is any improvement (better without compression). > >>> Most of the raid56 bugs I've identified have nothing to do with power >>> loss. The data on disks is fine, but the kernel can't read it correctly >>> in degraded mode, or the diagnostic data from scrub are clearly garbage. >> >> Unable to read in degraded mode just means parity is out-of-sync with data. > > No, the degraded mode case is different. It has distinct behavior from > the above test case where all the drives are online but csums are failing. > In degraded mode one of the devices is unavailable, so the read code > is trying to reconstruct data on the fly. The parity and data on disk > is often OK on the surviving disks if I dump it out by hand, and often > all the data can be recovered by 'btrfs replace' without error (as > long as 'btrfs replace' is the only active process on the filesystem). > > Rebooting the test VM will make a different set of data unreadable > through the filesystem, and the set of unreadable blocks changes over > time if running something like: > > sysctl vm.drop_caches=3; find -type f -exec cat {} + >/dev/null > > in a loop, especially if something is writing to the filesystem at the > same time. Note there is never a write hole in these test cases--the > filesystem is always cleanly umounted, and sometimes there's no umount > at all, one device is simply disconnected with no umount or reboot. > >> There are several other bugs related to this, mostly related to the >> cached raid bio and how we rebuild the data. (aka, btrfs/125) >> Thankfully I have submitted patches for that bug and now btrfs/125 >> should pass without problems. > > xfstests btrfs/125 is an extremely simple test case. I'm using btrfs > raid5 on 20-80TB filesystems, millions to billions of files. That's the difference between developer and end users, and I totally understand you want to really go heavy testing and squeeze out every bug. But from a developer's view, we prefer to fix bugs one by one. Btrfs/125 is a very short but effective test case to show how the original RAID56 repair path has problems mostly related to its cached radi56 behavior. I'm not saying the test case representing all the problems, but it's a very quick indicator of whether your code base has the fixes for RAID56 code. > The error > rate is quantitatively low (only 0.01% of data is lost after one disk > failure) but it should be zero, as none of my test cases involve write > hole, nodatacow, or raid5 metadata. > > for-next and misc-next are still quite broken, though to be fair they > definitely have issues beyond raid5. I'm more interesting in how misc-next is broken. I know that previous misc-next has some problems related to page faulting and can hang fsstress easily. But that should be fixed in recent misc-next, thus I strongly recommend to try it (and without compression) to see if there is any improvement for RAID56. Thanks, Qu > 5.18.12 can get through the > test without tripping over KASAN or blowing up the metadata, but it > has uncorrectable errors and fake read errors: > > # btrfs scrub start -Bd /testfs/ > > Scrub device /dev/vdb (id 1) done > Scrub started: Mon Jul 25 00:49:28 2022 > Status: finished > Duration: 0:03:03 > Total to scrub: 4.01GiB > Rate: 1.63MiB/s > Error summary: read=3 csum=7578 > Corrected: 7577 > Uncorrectable: 4 > Unverified: 1 > > I know the read errors are fake because /dev/vdb is a file on a tmpfs. > >> But the powerloss can still lead to out-of-sync parity and that's why >> I'm fixing the problem using write-intent-bitmaps. > > None of my test cases involve write hole, as I know write-hole test cases > will always fail. There's no point in testing write hole if recovery > from much simpler failures isn't working yet. > >>> I noticed you and others have done some work here recently, so some of >>> these issues might be fixed in 5.19. I haven't re-run my raid5 tests >>> on post-5.18 kernels yet (there have been other bugs blocking testing). >>> >>>> (There are still common problems shared between both btrfs raid56 and >>>> dm-raid56, like destructive-RMW) >>> >>> Yeah, that's one of the critical things to fix because btrfs is in a good >>> position to do as well or better than dm-raid56. btrfs has definitely >>> fallen behind the other available solutions in the 9 years since raid5 was >>> first added to btrfs, as btrfs implements only the basic configuration >>> of raid56 (no parity integrity or rmw journal) that is fully vulnerable >>> to write hole and drive-side data corruption. >>> >>>>> - and that it's officially not recommended for production use - it >>>> is a good idea to reconstruct new btrfs 'redundant-n' profiles that >>>> doesn't have the inherent issues of traditional RAID. >>>> >>>> I'd say the complexity is hugely underestimated. >>> >>> I'd agree with that. e.g. some btrfs equivalent of ZFS raidZ (put parity >>> blocks inline with extents during writes) is not much more complex to >>> implement on btrfs than compression; however, the btrfs kernel code >>> couldn't read compressed data correctly for 12 years out of its 14-year >>> history, and nobody wants to wait another decade or more for raid5 >>> to work. >>> >>> It seems to me the biggest problem with write hole fixes is that all >>> the potential fixes have cost tradeoffs, and everybody wants to veto >>> the fix that has a cost they don't like. >> >> Well, that's why I prefer multiple solutions for end users to choose, >> other than really trying to get a silver bullet solution. >> >> (That's also why I'm recently trying to separate block group tree from >> extent tree v2, as I really believe progressive improvement over a death >> ball feature) > > Yeah I'm definitely in favor of getting bgtree done sooner rather > than later. It's a simple, stand-alone feature that has well known > beneficial effect. If the extent tree v2 project wants to do something > incompatible with it later on, that's extent tree v2's problem, not a > reason to block bgtree in the short term. > >> Thanks, >> Qu >> >>> >>> We could implement multiple fix approaches at the same time, as AFAIK >>> most of the proposed solutions are orthogonal to each other. e.g. a >>> write-ahead log can safely enable RMW at a higher IO cost, while the >>> allocator could place extents to avoid RMW and thereby avoid the logging >>> cost as much as possible (paid for by a deferred relocation/garbage >>> collection cost), and using both at the same time would combine both >>> benefits. Both solutions can be used independently for filesystems at >>> extreme ends of the performance/capacity spectrum (if the filesystem is >>> never more than 50% full, then logging is all cost with no gain compared >>> to allocator avoidance of RMW, while a filesystem that is always near >>> full will have to journal writes and also throttle writes on the journal. >>> >>>>> For example a non-striped redundant-n profile as well as a striped redundant-n profile. >>>> >>>> Non-striped redundant-n profile is already so complex that I can't >>>> figure out a working idea right now. >>>> >>>> But if there is such way, I'm pretty happy to consider. >>>> >>>>> >>>>>> >>>>>> My 2 cents... >>>>>> >>>>>> Regarding the current raid56 support, in order of preference: >>>>>> >>>>>> a. Fix the current bugs, without changing format. Zygo has an extensive list. >>>>> >>>>> I agree that relatively simple fixes should be made. But it seems we will need quite a large rewrite to solve all issues? Is there a minium viable option here? >>>> >>>> Nope. Just see my write-intent code, already have prototype (just needs >>>> new scrub based recovery code at mount time) working. >>>> >>>> And based on my write-intent code, I don't think it's that hard to >>>> implement a full journal. >>> >>> FWIW I think we can get a very usable btrfs raid5 with a small format >>> change (add a journal for stripe RMW, though we might disagree about >>> details of how it should be structured and used) and fixes to the >>> read-repair and scrub problems. The read-side problems in btrfs raid5 >>> were always much more severe than the write hole. As soon as a disk >>> goes offline, the read-repair code is unable to read all the surviving >>> data correctly, and the filesystem has to be kept inactive or data on >>> the disks will be gradually corrupted as bad parity gets mixed with data >>> and written back to the filesystem. >>> >>> A few of the problems will require a deeper redesign, but IMHO they're not >>> important problems. e.g. scrub can't identify which drive is corrupted >>> in all cases, because it has no csum on parity blocks. The current >>> on-disk format needs every data block in the raid5 stripe to be occupied >>> by a file with a csum so scrub can eliminate every other block as the >>> possible source of mismatched parity. While this could be fixed by >>> a future new raid5 profile (and/or csum tree) specifically designed >>> to avoid this, it's not something I'd insist on having before deploying >>> a fleet of btrfs raid5 boxes. Silent corruption failures are so >>> rare on spinning disks that I'd use the feature maybe once a decade. >>> Silent corruption due to a failing or overheating HBA chip will most >>> likely affect multiple disks at once and trash the whole filesystem, >>> so individual drive-level corruption reporting isn't helpful. >>> >>>> Thanks, >>>> Qu >>>> >>>>> >>>>>> b. Mostly fix the write hole, also without changing the format, by >>>>>> only doing COW with full stripe writes. Yes you could somehow get >>>>>> corrupt parity still and not know it until degraded operation produces >>>>>> a bad reconstruction of data - but checksum will still catch that. >>>>>> This kind of "unreplicated corruption" is not quite the same thing as >>>>>> the write hole, because it isn't pernicious like the write hole. >>>>> >>>>> What is the difference to a)? Is write hole the worst issue? Judging from the #brtfs channel discussions there seems to be other quite severe issues, for example real data corruption risks in degraded mode. >>>>> >>>>>> c. A new de-clustered parity raid56 implementation that is not >>>>>> backwards compatible. >>>>> >>>>> Yes. We have a good opportunity to work out something much better than current implementations. We could have redundant-n profiles that also works with tired storage like ssd/nvme similar to the metadata on ssd idea. >>>>> >>>>> Variable stripe width has been brought up before, but received cool responses. Why is that? IMO it could improve random 4k ios by doing equivalent to RAID1 instead of RMW, while also closing the write hole. Perhaps there is a middle ground to be found? >>>>> >>>>> >>>>>> >>>>>> Ergo, I think it's best to not break the format twice. Even if a new >>>>>> raid implementation is years off. >>>>> >>>>> I very agree here. Btrfs already suffers in public opinion from the lack of a stable and safe-for-data RAID56, and requiring several non-compatible chances isn't going to help. >>>>> >>>>> I also think it's important that the 'temporary' changes actually leads to a stable filesystem. Because what is the point otherwise? >>>>> >>>>> Thanks >>>>> Forza >>>>> >>>>>> >>>>>> Metadata centric workloads suck on parity raid anyway. If Btrfs always >>>>>> does full stripe COW won't matter even if the performance is worse >>>>>> because no one should use parity raid for this workload anyway. >>>>>> >>>>>> >>>>>> -- >>>>>> Chris Murphy >>>>> >>>>> >>
On 25/07/2022 02.00, Zygo Blaxell wrote: > On Tue, Jul 19, 2022 at 09:19:21AM +0800, Qu Wenruo wrote: [...] > > I'd agree with that. e.g. some btrfs equivalent of ZFS raidZ (put parity > blocks inline with extents during writes) is not much more complex to > implement on btrfs than compression; however, the btrfs kernel code > couldn't read compressed data correctly for 12 years out of its 14-year > history, and nobody wants to wait another decade or more for raid5 > to work. > > It seems to me the biggest problem with write hole fixes is that all > the potential fixes have cost tradeoffs, and everybody wants to veto > the fix that has a cost they don't like. > > We could implement multiple fix approaches at the same time, as AFAIK > most of the proposed solutions are orthogonal to each other. e.g. a > write-ahead log can safely enable RMW at a higher IO cost, while the > allocator could place extents to avoid RMW and thereby avoid the logging > cost as much as possible (paid for by a deferred relocation/garbage > collection cost), and using both at the same time would combine both > benefits. Both solutions can be used independently for filesystems at > extreme ends of the performance/capacity spectrum (if the filesystem is > never more than 50% full, then logging is all cost with no gain compared > to allocator avoidance of RMW, while a filesystem that is always near > full will have to journal writes and also throttle writes on the journal. Kudos to Zygo; I have to say that I never encountered before a so clearly explanation of the complexity around btrfs raid5/6 problems and the related solutions. > >>> For example a non-striped redundant-n profile as well as a striped redundant-n profile. >> >> Non-striped redundant-n profile is already so complex that I can't >> figure out a working idea right now. >> >> But if there is such way, I'm pretty happy to consider. >> >>> >>>> >>>> My 2 cents... >>>> >>>> Regarding the current raid56 support, in order of preference: >>>> >>>> a. Fix the current bugs, without changing format. Zygo has an extensive list. >>> >>> I agree that relatively simple fixes should be made. But it seems we will need quite a large rewrite to solve all issues? Is there a minium viable option here? >> >> Nope. Just see my write-intent code, already have prototype (just needs >> new scrub based recovery code at mount time) working. >> >> And based on my write-intent code, I don't think it's that hard to >> implement a full journal. > > FWIW I think we can get a very usable btrfs raid5 with a small format > change (add a journal for stripe RMW, though we might disagree about > details of how it should be structured and used)... Again, I have to agree with Zygo. Even tough I am fascinating by a solution like ZFS (parity block inside the extent), I think that a journal (and a write intent log) is a more pragmatic approach: - this kind of solution is below the btrfs bg; this would avoid to add further pressure on the metadata - being below to the other btrfs structure may be shaped more easily with less risk of incompatibility It is true that a ZFS solution may be more faster in some workload, but I think that these are very few: - for high throughput, you likely write the full stripe which doesn't need journal/ppl - for small block update, a journal is more efficient than rewrite the full stripe I hope that the end of Qu activities, will be a more robust raid5 btrfs implementation, which will in turn increase the number of user, and which in turn increase the pressure to improve this part of btrfs. My only suggestion is to evaluate if we need to develop a write intent log and then a journal, instead of developing the journal alone. I think that two disk format changes are too much. BR G.Baroncelli > and fixes to the > read-repair and scrub problems. The read-side problems in btrfs raid5 > were always much more severe than the write hole. As soon as a disk > goes offline, the read-repair code is unable to read all the surviving > data correctly, and the filesystem has to be kept inactive or data on > the disks will be gradually corrupted as bad parity gets mixed with data > and written back to the filesystem. > > A few of the problems will require a deeper redesign, but IMHO they're not > important problems. e.g. scrub can't identify which drive is corrupted > in all cases, because it has no csum on parity blocks. The current > on-disk format needs every data block in the raid5 stripe to be occupied > by a file with a csum so scrub can eliminate every other block as the > possible source of mismatched parity. While this could be fixed by > a future new raid5 profile (and/or csum tree) specifically designed > to avoid this, it's not something I'd insist on having before deploying > a fleet of btrfs raid5 boxes. Silent corruption failures are so > rare on spinning disks that I'd use the feature maybe once a decade. > Silent corruption due to a failing or overheating HBA chip will most > likely affect multiple disks at once and trash the whole filesystem, > so individual drive-level corruption reporting isn't helpful. > >> Thanks, >> Qu >> >>> >>>> b. Mostly fix the write hole, also without changing the format, by >>>> only doing COW with full stripe writes. Yes you could somehow get >>>> corrupt parity still and not know it until degraded operation produces >>>> a bad reconstruction of data - but checksum will still catch that. >>>> This kind of "unreplicated corruption" is not quite the same thing as >>>> the write hole, because it isn't pernicious like the write hole. >>> >>> What is the difference to a)? Is write hole the worst issue? Judging from the #brtfs channel discussions there seems to be other quite severe issues, for example real data corruption risks in degraded mode. >>> >>>> c. A new de-clustered parity raid56 implementation that is not >>>> backwards compatible. >>> >>> Yes. We have a good opportunity to work out something much better than current implementations. We could have redundant-n profiles that also works with tired storage like ssd/nvme similar to the metadata on ssd idea. >>> >>> Variable stripe width has been brought up before, but received cool responses. Why is that? IMO it could improve random 4k ios by doing equivalent to RAID1 instead of RMW, while also closing the write hole. Perhaps there is a middle ground to be found? >>> >>> >>>> >>>> Ergo, I think it's best to not break the format twice. Even if a new >>>> raid implementation is years off. >>> >>> I very agree here. Btrfs already suffers in public opinion from the lack of a stable and safe-for-data RAID56, and requiring several non-compatible chances isn't going to help. >>> >>> I also think it's important that the 'temporary' changes actually leads to a stable filesystem. Because what is the point otherwise? >>> >>> Thanks >>> Forza >>> >>>> >>>> Metadata centric workloads suck on parity raid anyway. If Btrfs always >>>> does full stripe COW won't matter even if the performance is worse >>>> because no one should use parity raid for this workload anyway. >>>> >>>> >>>> -- >>>> Chris Murphy >>> >>>
On 2022/7/26 03:58, Goffredo Baroncelli wrote: > On 25/07/2022 02.00, Zygo Blaxell wrote: >> On Tue, Jul 19, 2022 at 09:19:21AM +0800, Qu Wenruo wrote: > [...] >> >> I'd agree with that. e.g. some btrfs equivalent of ZFS raidZ (put parity >> blocks inline with extents during writes) is not much more complex to >> implement on btrfs than compression; however, the btrfs kernel code >> couldn't read compressed data correctly for 12 years out of its 14-year >> history, and nobody wants to wait another decade or more for raid5 >> to work. >> >> It seems to me the biggest problem with write hole fixes is that all >> the potential fixes have cost tradeoffs, and everybody wants to veto >> the fix that has a cost they don't like. >> >> We could implement multiple fix approaches at the same time, as AFAIK >> most of the proposed solutions are orthogonal to each other. e.g. a >> write-ahead log can safely enable RMW at a higher IO cost, while the >> allocator could place extents to avoid RMW and thereby avoid the logging >> cost as much as possible (paid for by a deferred relocation/garbage >> collection cost), and using both at the same time would combine both >> benefits. Both solutions can be used independently for filesystems at >> extreme ends of the performance/capacity spectrum (if the filesystem is >> never more than 50% full, then logging is all cost with no gain compared >> to allocator avoidance of RMW, while a filesystem that is always near >> full will have to journal writes and also throttle writes on the journal. > > Kudos to Zygo; I have to say that I never encountered before a so clearly > explanation of the complexity around btrfs raid5/6 problems and the related > solutions. > >> >>>> For example a non-striped redundant-n profile as well as a striped >>>> redundant-n profile. >>> >>> Non-striped redundant-n profile is already so complex that I can't >>> figure out a working idea right now. >>> >>> But if there is such way, I'm pretty happy to consider. >>> >>>> >>>>> >>>>> My 2 cents... >>>>> >>>>> Regarding the current raid56 support, in order of preference: >>>>> >>>>> a. Fix the current bugs, without changing format. Zygo has an >>>>> extensive list. >>>> >>>> I agree that relatively simple fixes should be made. But it seems we >>>> will need quite a large rewrite to solve all issues? Is there a >>>> minium viable option here? >>> >>> Nope. Just see my write-intent code, already have prototype (just needs >>> new scrub based recovery code at mount time) working. >>> >>> And based on my write-intent code, I don't think it's that hard to >>> implement a full journal. >> >> FWIW I think we can get a very usable btrfs raid5 with a small format >> change (add a journal for stripe RMW, though we might disagree about >> details of how it should be structured and used)... > > Again, I have to agree with Zygo. Even tough I am fascinating by a solution > like ZFS (parity block inside the extent), I think that a journal (and a > write intent log) is a more pragmatic approach: > - this kind of solution is below the btrfs bg; this would avoid to add > further > pressure on the metadata > - being below to the other btrfs structure may be shaped more easily with > less risk of incompatibility > > It is true that a ZFS solution may be more faster in some workload, but > I think that these are very few: > - for high throughput, you likely write the full stripe which doesn't > need > journal/ppl > - for small block update, a journal is more efficient than rewrite the > full stripe > > > I hope that the end of Qu activities, will be a more robust raid5 btrfs > implementation, which will in turn increase the number of user, and which > in turn increase the pressure to improve this part of btrfs. > > My only suggestion is to evaluate if we need to develop a write intent log > and then a journal, instead of developing the journal alone. I think that > two disk format changes are too much. That won't be a problem. For write-intent, we only need 4K, while during the development, I have reserved 1MiB for write-intent and future journal. Thus the format change will only be once. Furthermore, that 1MiB can be tuned to be larger easily for journal. And for existing RAID56 users, there will be a pretty quick way to convert to the new write-intent/journal feature. Thanks, Qu > > > BR > G.Baroncelli >> and fixes to the >> read-repair and scrub problems. The read-side problems in btrfs raid5 >> were always much more severe than the write hole. As soon as a disk >> goes offline, the read-repair code is unable to read all the surviving >> data correctly, and the filesystem has to be kept inactive or data on >> the disks will be gradually corrupted as bad parity gets mixed with data >> and written back to the filesystem. >> >> A few of the problems will require a deeper redesign, but IMHO they're >> not >> important problems. e.g. scrub can't identify which drive is corrupted >> in all cases, because it has no csum on parity blocks. The current >> on-disk format needs every data block in the raid5 stripe to be occupied >> by a file with a csum so scrub can eliminate every other block as the >> possible source of mismatched parity. While this could be fixed by >> a future new raid5 profile (and/or csum tree) specifically designed >> to avoid this, it's not something I'd insist on having before deploying >> a fleet of btrfs raid5 boxes. Silent corruption failures are so >> rare on spinning disks that I'd use the feature maybe once a decade. >> Silent corruption due to a failing or overheating HBA chip will most >> likely affect multiple disks at once and trash the whole filesystem, >> so individual drive-level corruption reporting isn't helpful. >> >>> Thanks, >>> Qu >>> >>>> >>>>> b. Mostly fix the write hole, also without changing the format, by >>>>> only doing COW with full stripe writes. Yes you could somehow get >>>>> corrupt parity still and not know it until degraded operation produces >>>>> a bad reconstruction of data - but checksum will still catch that. >>>>> This kind of "unreplicated corruption" is not quite the same thing as >>>>> the write hole, because it isn't pernicious like the write hole. >>>> >>>> What is the difference to a)? Is write hole the worst issue? Judging >>>> from the #brtfs channel discussions there seems to be other quite >>>> severe issues, for example real data corruption risks in degraded mode. >>>> >>>>> c. A new de-clustered parity raid56 implementation that is not >>>>> backwards compatible. >>>> >>>> Yes. We have a good opportunity to work out something much better >>>> than current implementations. We could have redundant-n profiles >>>> that also works with tired storage like ssd/nvme similar to the >>>> metadata on ssd idea. >>>> >>>> Variable stripe width has been brought up before, but received cool >>>> responses. Why is that? IMO it could improve random 4k ios by doing >>>> equivalent to RAID1 instead of RMW, while also closing the write >>>> hole. Perhaps there is a middle ground to be found? >>>> >>>> >>>>> >>>>> Ergo, I think it's best to not break the format twice. Even if a new >>>>> raid implementation is years off. >>>> >>>> I very agree here. Btrfs already suffers in public opinion from the >>>> lack of a stable and safe-for-data RAID56, and requiring several >>>> non-compatible chances isn't going to help. >>>> >>>> I also think it's important that the 'temporary' changes actually >>>> leads to a stable filesystem. Because what is the point otherwise? >>>> >>>> Thanks >>>> Forza >>>> >>>>> >>>>> Metadata centric workloads suck on parity raid anyway. If Btrfs always >>>>> does full stripe COW won't matter even if the performance is worse >>>>> because no one should use parity raid for this workload anyway. >>>>> >>>>> >>>>> -- >>>>> Chris Murphy >>>> >>>> >