Message ID | 20180809180450.5091-1-naota@elisp.net (mailing list archive) |
---|---|
Headers | show |
Series | btrfs zoned block device support | expand |
On 08/09/2018 08:04 PM, Naohiro Aota wrote: > This series adds zoned block device support to btrfs. > > A zoned block device consists of a number of zones. Zones are either > conventional and accepting random writes or sequential and requiring that > writes be issued in LBA order from each zone write pointer position. This > patch series ensures that the sequential write constraint of sequential > zones is respected while fundamentally not changing BtrFS block and I/O > management for block stored in conventional zones. > > To achieve this, the default dev extent size of btrfs is changed on zoned > block devices so that dev extents are always aligned to a zone. Allocation > of blocks within a block group is changed so that the allocation is always > sequential from the beginning of the block groups. To do so, an allocation > pointer is added to block groups and used as the allocation hint. The > allocation changes also ensures that block freed below the allocation > pointer are ignored, resulting in sequential block allocation regardless of > the block group usage. > > While the introduction of the allocation pointer ensure that blocks will be > allocated sequentially, I/Os to write out newly allocated blocks may be > issued out of order, causing errors when writing to sequential zones. This > problem s solved by introducing a submit_buffer() function and changes to > the internal I/O scheduler to ensure in-order issuing of write I/Os for > each chunk and corresponding to the block allocation order in the chunk. > > The zones of a chunk are reset to allow reusing of the zone only when the > block group is being freed, that is, when all the extents of the block group > are unused. > > For btrfs volumes composed of multiple zoned disks, restrictions are added > to ensure that all disks have the same zone size. This matches the existing > constraint that all dev extents in a chunk must have the same size. > > It requires zoned block devices to test the patchset. Even if you don't > have zone devices, you can use tcmu-runner [1] to emulate zoned block > devices. It can export emulated zoned block devices via iSCSI. Please see > the README.md of tcmu-runner [2] for howtos to generate a zoned block > device on tcmu-runner. > > [1] https://github.com/open-iscsi/tcmu-runner > [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md > > Patch 1 introduces the HMZONED incompatible feature flag to indicate that > the btrfs volume was formatted for use on zoned block devices. > > Patches 2 and 3 implement functions to gather information on the zones of > the device (zones type and write pointer position). > > Patch 4 restrict the possible locations of super blocks to conventional > zones to preserve the existing update in-place mechanism for the super > blocks. > > Patches 5 to 7 disable features which are not compatible with the sequential > write constraints of zoned block devices. This includes fallocate and > direct I/O support. Device replace is also disabled for now. > > Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to > implement sequential block allocation in block groups and chunks. > > Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential > write I/O delivery to the device zones. > > Patches 13 to 16 modify several parts of btrfs to handle free blocks > without breaking the sequential block allocation and sequential write order > as well as zone reset for unused chunks. > > Finally, patch 17 adds the HMZONED feature to the list of supported > features. > Thanks for doing all the work. However, the patches don't apply cleanly to current master branch. Can you please rebase them? Thanks. Cheers, Hannes
On 08/09/2018 08:04 PM, Naohiro Aota wrote: > This series adds zoned block device support to btrfs. > > A zoned block device consists of a number of zones. Zones are either > conventional and accepting random writes or sequential and requiring that > writes be issued in LBA order from each zone write pointer position. This > patch series ensures that the sequential write constraint of sequential > zones is respected while fundamentally not changing BtrFS block and I/O > management for block stored in conventional zones. > > To achieve this, the default dev extent size of btrfs is changed on zoned > block devices so that dev extents are always aligned to a zone. Allocation > of blocks within a block group is changed so that the allocation is always > sequential from the beginning of the block groups. To do so, an allocation > pointer is added to block groups and used as the allocation hint. The > allocation changes also ensures that block freed below the allocation > pointer are ignored, resulting in sequential block allocation regardless of > the block group usage. > > While the introduction of the allocation pointer ensure that blocks will be > allocated sequentially, I/Os to write out newly allocated blocks may be > issued out of order, causing errors when writing to sequential zones. This > problem s solved by introducing a submit_buffer() function and changes to > the internal I/O scheduler to ensure in-order issuing of write I/Os for > each chunk and corresponding to the block allocation order in the chunk. > > The zones of a chunk are reset to allow reusing of the zone only when the > block group is being freed, that is, when all the extents of the block group > are unused. > > For btrfs volumes composed of multiple zoned disks, restrictions are added > to ensure that all disks have the same zone size. This matches the existing > constraint that all dev extents in a chunk must have the same size. > > It requires zoned block devices to test the patchset. Even if you don't > have zone devices, you can use tcmu-runner [1] to emulate zoned block > devices. It can export emulated zoned block devices via iSCSI. Please see > the README.md of tcmu-runner [2] for howtos to generate a zoned block > device on tcmu-runner. > > [1] https://github.com/open-iscsi/tcmu-runner > [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md > > Patch 1 introduces the HMZONED incompatible feature flag to indicate that > the btrfs volume was formatted for use on zoned block devices. > > Patches 2 and 3 implement functions to gather information on the zones of > the device (zones type and write pointer position). > > Patch 4 restrict the possible locations of super blocks to conventional > zones to preserve the existing update in-place mechanism for the super > blocks. > > Patches 5 to 7 disable features which are not compatible with the sequential > write constraints of zoned block devices. This includes fallocate and > direct I/O support. Device replace is also disabled for now. > > Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to > implement sequential block allocation in block groups and chunks. > > Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential > write I/O delivery to the device zones. > > Patches 13 to 16 modify several parts of btrfs to handle free blocks > without breaking the sequential block allocation and sequential write order > as well as zone reset for unused chunks. > > Finally, patch 17 adds the HMZONED feature to the list of supported > features. > > Naohiro Aota (17): > btrfs: introduce HMZONED feature flag > btrfs: Get zone information of zoned block devices > btrfs: Check and enable HMZONED mode > btrfs: limit super block locations in HMZONED mode > btrfs: disable fallocate in HMZONED mode > btrfs: disable direct IO in HMZONED mode > btrfs: disable device replace in HMZONED mode > btrfs: align extent allocation to zone boundary > btrfs: do sequential allocation on HMZONED drives > btrfs: split btrfs_map_bio() > btrfs: introduce submit buffer > btrfs: expire submit buffer on timeout > btrfs: avoid sync IO prioritization on checksum in HMZONED mode > btrfs: redirty released extent buffers in sequential BGs > btrfs: reset zones of unused block groups > btrfs: wait existing extents before truncating > btrfs: enable to mount HMZONED incompat flag > And unfortunately this series fails to boot for me: BTRFS error (device nvme0n1p5): zoned devices mixed with regular devices BTRFS error (device nvme0n1p5): failed to init hmzoned mode: -22 BTRFS error (device nvme0n1p5): open_ctree failed Needless to say, /dev/nvme0n1p5 is _not_ a zoned device. Nor has the zoned device a btrfs superblock ATM. Cheers, Hannes
On 8/10/18 2:04 AM, Naohiro Aota wrote: > This series adds zoned block device support to btrfs. > > A zoned block device consists of a number of zones. Zones are either > conventional and accepting random writes or sequential and requiring that > writes be issued in LBA order from each zone write pointer position. Not familiar with zoned block device, especially for the sequential case. Is that sequential case tape like? > This > patch series ensures that the sequential write constraint of sequential > zones is respected while fundamentally not changing BtrFS block and I/O > management for block stored in conventional zones. > > To achieve this, the default dev extent size of btrfs is changed on zoned > block devices so that dev extents are always aligned to a zone. Allocation > of blocks within a block group is changed so that the allocation is always > sequential from the beginning of the block groups. To do so, an allocation > pointer is added to block groups and used as the allocation hint. The > allocation changes also ensures that block freed below the allocation > pointer are ignored, resulting in sequential block allocation regardless of > the block group usage. This looks like it would cause a lot of holes for metadata block groups. It would be better to avoid metadata block allocation in such sequential zone. (And that would need the infrastructure to make extent allocator priority-aware) > > While the introduction of the allocation pointer ensure that blocks will be > allocated sequentially, I/Os to write out newly allocated blocks may be > issued out of order, causing errors when writing to sequential zones. This > problem s solved by introducing a submit_buffer() function and changes to > the internal I/O scheduler to ensure in-order issuing of write I/Os for > each chunk and corresponding to the block allocation order in the chunk. > > The zones of a chunk are reset to allow reusing of the zone only when the > block group is being freed, that is, when all the extents of the block group > are unused. > > For btrfs volumes composed of multiple zoned disks, restrictions are added > to ensure that all disks have the same zone size. This matches the existing > constraint that all dev extents in a chunk must have the same size. > > It requires zoned block devices to test the patchset. Even if you don't > have zone devices, you can use tcmu-runner [1] to emulate zoned block > devices. It can export emulated zoned block devices via iSCSI. Please see > the README.md of tcmu-runner [2] for howtos to generate a zoned block > device on tcmu-runner. > > [1] https://github.com/open-iscsi/tcmu-runner > [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md > > Patch 1 introduces the HMZONED incompatible feature flag to indicate that > the btrfs volume was formatted for use on zoned block devices. > > Patches 2 and 3 implement functions to gather information on the zones of > the device (zones type and write pointer position). > > Patch 4 restrict the possible locations of super blocks to conventional > zones to preserve the existing update in-place mechanism for the super > blocks. > > Patches 5 to 7 disable features which are not compatible with the sequential > write constraints of zoned block devices. This includes fallocate and > direct I/O support. Device replace is also disabled for now. > > Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to > implement sequential block allocation in block groups and chunks. > > Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential > write I/O delivery to the device zones. > > Patches 13 to 16 modify several parts of btrfs to handle free blocks > without breaking the sequential block allocation and sequential write order > as well as zone reset for unused chunks. > > Finally, patch 17 adds the HMZONED feature to the list of supported > features. > > Naohiro Aota (17): > btrfs: introduce HMZONED feature flag > btrfs: Get zone information of zoned block devices > btrfs: Check and enable HMZONED mode > btrfs: limit super block locations in HMZONED mode > btrfs: disable fallocate in HMZONED mode > btrfs: disable direct IO in HMZONED mode > btrfs: disable device replace in HMZONED mode > btrfs: align extent allocation to zone boundary According to the patch name, I though it's about extent allocation, but in fact it's about dev extent allocation. Renaming the patch would make more sense. > btrfs: do sequential allocation on HMZONED drives And this is the patch modifying extent allocator. Despite that, the support zoned storage looks pretty interesting and have something in common with planned priority-aware extent allocator. Thanks, Qu > btrfs: split btrfs_map_bio() > btrfs: introduce submit buffer > btrfs: expire submit buffer on timeout > btrfs: avoid sync IO prioritization on checksum in HMZONED mode > btrfs: redirty released extent buffers in sequential BGs > btrfs: reset zones of unused block groups > btrfs: wait existing extents before truncating > btrfs: enable to mount HMZONED incompat flag > > fs/btrfs/async-thread.c | 1 + > fs/btrfs/async-thread.h | 1 + > fs/btrfs/ctree.h | 36 ++- > fs/btrfs/dev-replace.c | 10 + > fs/btrfs/disk-io.c | 48 +++- > fs/btrfs/extent-tree.c | 281 +++++++++++++++++- > fs/btrfs/extent_io.c | 1 + > fs/btrfs/extent_io.h | 1 + > fs/btrfs/file.c | 4 + > fs/btrfs/free-space-cache.c | 36 +++ > fs/btrfs/free-space-cache.h | 10 + > fs/btrfs/inode.c | 14 + > fs/btrfs/super.c | 32 ++- > fs/btrfs/sysfs.c | 2 + > fs/btrfs/transaction.c | 32 +++ > fs/btrfs/transaction.h | 3 + > fs/btrfs/volumes.c | 551 ++++++++++++++++++++++++++++++++++-- > fs/btrfs/volumes.h | 37 +++ > include/uapi/linux/btrfs.h | 1 + > 19 files changed, 1061 insertions(+), 40 deletions(-) >
On 9.08.2018 21:04, Naohiro Aota wrote: > This series adds zoned block device support to btrfs. > > A zoned block device consists of a number of zones. Zones are either > conventional and accepting random writes or sequential and requiring that > writes be issued in LBA order from each zone write pointer position. This > patch series ensures that the sequential write constraint of sequential > zones is respected while fundamentally not changing BtrFS block and I/O > management for block stored in conventional zones. > > To achieve this, the default dev extent size of btrfs is changed on zoned > block devices so that dev extents are always aligned to a zone. Allocation > of blocks within a block group is changed so that the allocation is always > sequential from the beginning of the block groups. To do so, an allocation > pointer is added to block groups and used as the allocation hint. The > allocation changes also ensures that block freed below the allocation > pointer are ignored, resulting in sequential block allocation regardless of > the block group usage. > > While the introduction of the allocation pointer ensure that blocks will be > allocated sequentially, I/Os to write out newly allocated blocks may be > issued out of order, causing errors when writing to sequential zones. This > problem s solved by introducing a submit_buffer() function and changes to > the internal I/O scheduler to ensure in-order issuing of write I/Os for > each chunk and corresponding to the block allocation order in the chunk. > > The zones of a chunk are reset to allow reusing of the zone only when the > block group is being freed, that is, when all the extents of the block group > are unused. > > For btrfs volumes composed of multiple zoned disks, restrictions are added > to ensure that all disks have the same zone size. This matches the existing > constraint that all dev extents in a chunk must have the same size. > > It requires zoned block devices to test the patchset. Even if you don't > have zone devices, you can use tcmu-runner [1] to emulate zoned block > devices. It can export emulated zoned block devices via iSCSI. Please see > the README.md of tcmu-runner [2] for howtos to generate a zoned block > device on tcmu-runner. > > [1] https://github.com/open-iscsi/tcmu-runner > [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md > > Patch 1 introduces the HMZONED incompatible feature flag to indicate that > the btrfs volume was formatted for use on zoned block devices. > > Patches 2 and 3 implement functions to gather information on the zones of > the device (zones type and write pointer position). > > Patch 4 restrict the possible locations of super blocks to conventional > zones to preserve the existing update in-place mechanism for the super > blocks. > > Patches 5 to 7 disable features which are not compatible with the sequential > write constraints of zoned block devices. This includes fallocate and > direct I/O support. Device replace is also disabled for now. > > Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to > implement sequential block allocation in block groups and chunks. > > Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential > write I/O delivery to the device zones. > > Patches 13 to 16 modify several parts of btrfs to handle free blocks > without breaking the sequential block allocation and sequential write order > as well as zone reset for unused chunks. > > Finally, patch 17 adds the HMZONED feature to the list of supported > features. > > Naohiro Aota (17): > btrfs: introduce HMZONED feature flag > btrfs: Get zone information of zoned block devices > btrfs: Check and enable HMZONED mode > btrfs: limit super block locations in HMZONED mode > btrfs: disable fallocate in HMZONED mode > btrfs: disable direct IO in HMZONED mode > btrfs: disable device replace in HMZONED mode > btrfs: align extent allocation to zone boundary > btrfs: do sequential allocation on HMZONED drives > btrfs: split btrfs_map_bio() > btrfs: introduce submit buffer > btrfs: expire submit buffer on timeout > btrfs: avoid sync IO prioritization on checksum in HMZONED mode > btrfs: redirty released extent buffers in sequential BGs > btrfs: reset zones of unused block groups > btrfs: wait existing extents before truncating > btrfs: enable to mount HMZONED incompat flag > > fs/btrfs/async-thread.c | 1 + > fs/btrfs/async-thread.h | 1 + > fs/btrfs/ctree.h | 36 ++- > fs/btrfs/dev-replace.c | 10 + > fs/btrfs/disk-io.c | 48 +++- > fs/btrfs/extent-tree.c | 281 +++++++++++++++++- > fs/btrfs/extent_io.c | 1 + > fs/btrfs/extent_io.h | 1 + > fs/btrfs/file.c | 4 + > fs/btrfs/free-space-cache.c | 36 +++ > fs/btrfs/free-space-cache.h | 10 + > fs/btrfs/inode.c | 14 + > fs/btrfs/super.c | 32 ++- > fs/btrfs/sysfs.c | 2 + > fs/btrfs/transaction.c | 32 +++ > fs/btrfs/transaction.h | 3 + > fs/btrfs/volumes.c | 551 ++++++++++++++++++++++++++++++++++-- > fs/btrfs/volumes.h | 37 +++ > include/uapi/linux/btrfs.h | 1 + > 19 files changed, 1061 insertions(+), 40 deletions(-) > There are multiple places where you do naked shifts by ilog2(sectorsize). There is a perfectly well named define: SECTOR_SHIFT which a lot more informative for someone who doesn't necessarily have experience with linux storage/fs layers. Please fix such occurrences of magic values shifting.
On 10.08.2018 10:53, Nikolay Borisov wrote: > > > On 9.08.2018 21:04, Naohiro Aota wrote: >> This series adds zoned block device support to btrfs. >> >> A zoned block device consists of a number of zones. Zones are either >> conventional and accepting random writes or sequential and requiring that >> writes be issued in LBA order from each zone write pointer position. This >> patch series ensures that the sequential write constraint of sequential >> zones is respected while fundamentally not changing BtrFS block and I/O >> management for block stored in conventional zones. >> >> To achieve this, the default dev extent size of btrfs is changed on zoned >> block devices so that dev extents are always aligned to a zone. Allocation >> of blocks within a block group is changed so that the allocation is always >> sequential from the beginning of the block groups. To do so, an allocation >> pointer is added to block groups and used as the allocation hint. The >> allocation changes also ensures that block freed below the allocation >> pointer are ignored, resulting in sequential block allocation regardless of >> the block group usage. >> >> While the introduction of the allocation pointer ensure that blocks will be >> allocated sequentially, I/Os to write out newly allocated blocks may be >> issued out of order, causing errors when writing to sequential zones. This >> problem s solved by introducing a submit_buffer() function and changes to >> the internal I/O scheduler to ensure in-order issuing of write I/Os for >> each chunk and corresponding to the block allocation order in the chunk. >> >> The zones of a chunk are reset to allow reusing of the zone only when the >> block group is being freed, that is, when all the extents of the block group >> are unused. >> >> For btrfs volumes composed of multiple zoned disks, restrictions are added >> to ensure that all disks have the same zone size. This matches the existing >> constraint that all dev extents in a chunk must have the same size. >> >> It requires zoned block devices to test the patchset. Even if you don't >> have zone devices, you can use tcmu-runner [1] to emulate zoned block >> devices. It can export emulated zoned block devices via iSCSI. Please see >> the README.md of tcmu-runner [2] for howtos to generate a zoned block >> device on tcmu-runner. >> >> [1] https://github.com/open-iscsi/tcmu-runner >> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md >> >> Patch 1 introduces the HMZONED incompatible feature flag to indicate that >> the btrfs volume was formatted for use on zoned block devices. >> >> Patches 2 and 3 implement functions to gather information on the zones of >> the device (zones type and write pointer position). >> >> Patch 4 restrict the possible locations of super blocks to conventional >> zones to preserve the existing update in-place mechanism for the super >> blocks. >> >> Patches 5 to 7 disable features which are not compatible with the sequential >> write constraints of zoned block devices. This includes fallocate and >> direct I/O support. Device replace is also disabled for now. >> >> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to >> implement sequential block allocation in block groups and chunks. >> >> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential >> write I/O delivery to the device zones. >> >> Patches 13 to 16 modify several parts of btrfs to handle free blocks >> without breaking the sequential block allocation and sequential write order >> as well as zone reset for unused chunks. >> >> Finally, patch 17 adds the HMZONED feature to the list of supported >> features. >> >> Naohiro Aota (17): >> btrfs: introduce HMZONED feature flag >> btrfs: Get zone information of zoned block devices >> btrfs: Check and enable HMZONED mode >> btrfs: limit super block locations in HMZONED mode >> btrfs: disable fallocate in HMZONED mode >> btrfs: disable direct IO in HMZONED mode >> btrfs: disable device replace in HMZONED mode >> btrfs: align extent allocation to zone boundary >> btrfs: do sequential allocation on HMZONED drives >> btrfs: split btrfs_map_bio() >> btrfs: introduce submit buffer >> btrfs: expire submit buffer on timeout >> btrfs: avoid sync IO prioritization on checksum in HMZONED mode >> btrfs: redirty released extent buffers in sequential BGs >> btrfs: reset zones of unused block groups >> btrfs: wait existing extents before truncating >> btrfs: enable to mount HMZONED incompat flag >> >> fs/btrfs/async-thread.c | 1 + >> fs/btrfs/async-thread.h | 1 + >> fs/btrfs/ctree.h | 36 ++- >> fs/btrfs/dev-replace.c | 10 + >> fs/btrfs/disk-io.c | 48 +++- >> fs/btrfs/extent-tree.c | 281 +++++++++++++++++- >> fs/btrfs/extent_io.c | 1 + >> fs/btrfs/extent_io.h | 1 + >> fs/btrfs/file.c | 4 + >> fs/btrfs/free-space-cache.c | 36 +++ >> fs/btrfs/free-space-cache.h | 10 + >> fs/btrfs/inode.c | 14 + >> fs/btrfs/super.c | 32 ++- >> fs/btrfs/sysfs.c | 2 + >> fs/btrfs/transaction.c | 32 +++ >> fs/btrfs/transaction.h | 3 + >> fs/btrfs/volumes.c | 551 ++++++++++++++++++++++++++++++++++-- >> fs/btrfs/volumes.h | 37 +++ >> include/uapi/linux/btrfs.h | 1 + >> 19 files changed, 1061 insertions(+), 40 deletions(-) >> > > There are multiple places where you do naked shifts by > ilog2(sectorsize). There is a perfectly well named define: SECTOR_SHIFT > which a lot more informative for someone who doesn't necessarily have > experience with linux storage/fs layers. Please fix such occurrences of > magic values shifting. > And Hannes just reminded me that this lannded in commit : 233bde21aa43 ("block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h>") This March so it might fairly recent depending on the tree you've based your work on.
On 08/10/2018 09:28 AM, Qu Wenruo wrote: > > > On 8/10/18 2:04 AM, Naohiro Aota wrote: >> This series adds zoned block device support to btrfs. >> >> [...] > > And this is the patch modifying extent allocator. > > Despite that, the support zoned storage looks pretty interesting and > have something in common with planned priority-aware extent allocator. Priority-aware allocator? Is someone actually working on that, or is it planned like everything is 'planned' (i.e. nice idea, and might happen or might as well not happen ever, SIYH)?
On 8/10/18 9:32 PM, Hans van Kranenburg wrote: > On 08/10/2018 09:28 AM, Qu Wenruo wrote: >> >> >> On 8/10/18 2:04 AM, Naohiro Aota wrote: >>> This series adds zoned block device support to btrfs. >>> >>> [...] >> >> And this is the patch modifying extent allocator. >> >> Despite that, the support zoned storage looks pretty interesting and >> have something in common with planned priority-aware extent allocator. > > Priority-aware allocator? Is someone actually working on that, or is it > planned like everything is 'planned' (i.e. nice idea, and might happen > or might as well not happen ever, SIYH)? I'm working on this, although it will take some time. Although it's originally designed to solve the problem where some empty block groups won't be freed due to pinned bytes. Thanks, Qu >
On Fri, Aug 10, 2018 at 09:04:59AM +0200, Hannes Reinecke wrote: > On 08/09/2018 08:04 PM, Naohiro Aota wrote: > > This series adds zoned block device support to btrfs. > > > > A zoned block device consists of a number of zones. Zones are either > > conventional and accepting random writes or sequential and requiring that > > writes be issued in LBA order from each zone write pointer position. This > > patch series ensures that the sequential write constraint of sequential > > zones is respected while fundamentally not changing BtrFS block and I/O > > management for block stored in conventional zones. > > > > To achieve this, the default dev extent size of btrfs is changed on zoned > > block devices so that dev extents are always aligned to a zone. Allocation > > of blocks within a block group is changed so that the allocation is always > > sequential from the beginning of the block groups. To do so, an allocation > > pointer is added to block groups and used as the allocation hint. The > > allocation changes also ensures that block freed below the allocation > > pointer are ignored, resulting in sequential block allocation regardless of > > the block group usage. > > > > While the introduction of the allocation pointer ensure that blocks will be > > allocated sequentially, I/Os to write out newly allocated blocks may be > > issued out of order, causing errors when writing to sequential zones. This > > problem s solved by introducing a submit_buffer() function and changes to > > the internal I/O scheduler to ensure in-order issuing of write I/Os for > > each chunk and corresponding to the block allocation order in the chunk. > > > > The zones of a chunk are reset to allow reusing of the zone only when the > > block group is being freed, that is, when all the extents of the block group > > are unused. > > > > For btrfs volumes composed of multiple zoned disks, restrictions are added > > to ensure that all disks have the same zone size. This matches the existing > > constraint that all dev extents in a chunk must have the same size. > > > > It requires zoned block devices to test the patchset. Even if you don't > > have zone devices, you can use tcmu-runner [1] to emulate zoned block > > devices. It can export emulated zoned block devices via iSCSI. Please see > > the README.md of tcmu-runner [2] for howtos to generate a zoned block > > device on tcmu-runner. > > > > [1] https://github.com/open-iscsi/tcmu-runner > > [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md > > > > Patch 1 introduces the HMZONED incompatible feature flag to indicate that > > the btrfs volume was formatted for use on zoned block devices. > > > > Patches 2 and 3 implement functions to gather information on the zones of > > the device (zones type and write pointer position). > > > > Patch 4 restrict the possible locations of super blocks to conventional > > zones to preserve the existing update in-place mechanism for the super > > blocks. > > > > Patches 5 to 7 disable features which are not compatible with the sequential > > write constraints of zoned block devices. This includes fallocate and > > direct I/O support. Device replace is also disabled for now. > > > > Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to > > implement sequential block allocation in block groups and chunks. > > > > Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential > > write I/O delivery to the device zones. > > > > Patches 13 to 16 modify several parts of btrfs to handle free blocks > > without breaking the sequential block allocation and sequential write order > > as well as zone reset for unused chunks. > > > > Finally, patch 17 adds the HMZONED feature to the list of supported > > features. > > > Thanks for doing all the work. > However, the patches don't apply cleanly to current master branch. > Can you please rebase them? I'm currently basing on https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-next branch, since my previous bug-fix patch 266e010932ce ("btrfs: revert fs_devices state on error of btrfs_init_new_device") is necessary to avoid use-after-free bug in error handling path of btrfs_init_new_device() in the patch 2. I'm sorry for not mentioning it. I'll rebase on the master branch when the patch reach the master. Regards, Naohiro > Thanks. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke zSeries & Storage > hare@suse.com +49 911 74053 688 > SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg > GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton > HRB 21284 (AG Nürnberg)
On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
Yay, thanks!
As this a RFC, I'll give you some. The code looks ok for what it claims
to do, I'll skip style and unimportant implementation details for now as
there are bigger questions.
The zoned devices bring some constraints so not all filesystem features
cannot be expected to work, so this rules out any form of in-place
updates like NODATACOW.
Then there's list of 'how will zoned device work with feature X'?
You disable fallocate and DIO. I haven't looked closer at the fallocate
case, but DIO could work in the sense that open() will open the file but
any write will fallback to buffered writes. This is implemented so it
would need to be wired together.
Mixed device types are not allowed, and I tend to agree with that,
though this could work in principle. Just that the chunk allocator
would have to be aware of the device types and tweaked to allocate from
the same group. The btrfs code is not ready for that in terms of the
allocator capabilities and configuration options.
Device replace is disabled, but the changlog suggests there's a way to
make it work, so it's a matter of implementation. And this should be
implemented at the time of merge.
RAID5/6 + zoned support is highly desired and lack of it could be
considered a NAK for the whole series. The drive sizes are expected to
be several terabytes, that sounds be too risky to lack the redundancy
options (RAID1 is not sufficient here).
The changelog does not explain why this does not or cannot work, so I
cannot reason about that or possibly suggest workarounds or solutions.
But I think it should work in principle.
As this is first post and RFC I don't expect that everything is
implemented, but at least the known missing points should be documented.
You've implemented lots of the low-level zoned support and extent
allocation, so even if the raid56 might be difficult, it should be the
smaller part.
On 08/13/2018 08:42 PM, David Sterba wrote: > On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote: >> This series adds zoned block device support to btrfs. > > Yay, thanks! > > As this a RFC, I'll give you some. The code looks ok for what it claims > to do, I'll skip style and unimportant implementation details for now as > there are bigger questions. > > The zoned devices bring some constraints so not all filesystem features > cannot be expected to work, so this rules out any form of in-place > updates like NODATACOW. > > Then there's list of 'how will zoned device work with feature X'? > > You disable fallocate and DIO. I haven't looked closer at the fallocate > case, but DIO could work in the sense that open() will open the file but > any write will fallback to buffered writes. This is implemented so it > would need to be wired together. > > Mixed device types are not allowed, and I tend to agree with that, > though this could work in principle. Just that the chunk allocator > would have to be aware of the device types and tweaked to allocate from > the same group. The btrfs code is not ready for that in terms of the > allocator capabilities and configuration options. > > Device replace is disabled, but the changlog suggests there's a way to > make it work, so it's a matter of implementation. And this should be > implemented at the time of merge. > How would a device replace work in general? While I do understand that device replace is possible with RAID thingies, I somewhat fail to see how could do a device replacement without RAID functionality. Is it even possible? If so, how would it be different from a simple umount? > RAID5/6 + zoned support is highly desired and lack of it could be > considered a NAK for the whole series. The drive sizes are expected to > be several terabytes, that sounds be too risky to lack the redundancy > options (RAID1 is not sufficient here). > That really depends on the allocator. If we can make the RAID code to work with zone-sized stripes it should be pretty trivial. I can have a look at that; RAID support was on my agenda anyway (albeit for MD, not for btrfs). > The changelog does not explain why this does not or cannot work, so I > cannot reason about that or possibly suggest workarounds or solutions. > But I think it should work in principle. > As mentioned, it really should work for zone-sized stripes. I'm not sure we can make it to work with stripes less than zone sizes. > As this is first post and RFC I don't expect that everything is > implemented, but at least the known missing points should be documented. > You've implemented lots of the low-level zoned support and extent > allocation, so even if the raid56 might be difficult, it should be the > smaller part. > FYI, I've run a simple stress-test on a zoned device (git clone linus && make) and haven't found any issue with those; compilation ran without a problem, and with quite decent speed. Good job! Cheers, Hannes
On 2018-08-13 15:20, Hannes Reinecke wrote: > On 08/13/2018 08:42 PM, David Sterba wrote: >> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote: >>> This series adds zoned block device support to btrfs. >> >> Yay, thanks! >> >> As this a RFC, I'll give you some. The code looks ok for what it claims >> to do, I'll skip style and unimportant implementation details for now as >> there are bigger questions. >> >> The zoned devices bring some constraints so not all filesystem features >> cannot be expected to work, so this rules out any form of in-place >> updates like NODATACOW. >> >> Then there's list of 'how will zoned device work with feature X'? >> >> You disable fallocate and DIO. I haven't looked closer at the fallocate >> case, but DIO could work in the sense that open() will open the file but >> any write will fallback to buffered writes. This is implemented so it >> would need to be wired together. >> >> Mixed device types are not allowed, and I tend to agree with that, >> though this could work in principle. Just that the chunk allocator >> would have to be aware of the device types and tweaked to allocate from >> the same group. The btrfs code is not ready for that in terms of the >> allocator capabilities and configuration options. >> >> Device replace is disabled, but the changlog suggests there's a way to >> make it work, so it's a matter of implementation. And this should be >> implemented at the time of merge. >> > How would a device replace work in general? > While I do understand that device replace is possible with RAID > thingies, I somewhat fail to see how could do a device replacement > without RAID functionality. > Is it even possible? > If so, how would it be different from a simple umount? Device replace is implemented in largely the same manner as most other live data migration tools (for example, LVM2's pvmove command). In short, when you issue a replace command for a given device, all writes that would go to that device are instead sent to the new device. While this is happening, old data is copied over from the old device to the new one. Once all the data is copied, the old device is released (and it's BTRFS signature wiped), and the new device has it's device ID updated to that of the old device. This is possible largely because of the COW infrastructure, but it's implemented in a way that doesn't entirely depend on it (otherwise it wouldn't work for NOCOW files). Handling this on zoned devices is not likely to be easy though, you would functionally have to freeze I/O that would hit the device being replaced so that you don't accidentally write to a sequential zone out of order. > >> RAID5/6 + zoned support is highly desired and lack of it could be >> considered a NAK for the whole series. The drive sizes are expected to >> be several terabytes, that sounds be too risky to lack the redundancy >> options (RAID1 is not sufficient here). >> > That really depends on the allocator. > If we can make the RAID code to work with zone-sized stripes it should > be pretty trivial. I can have a look at that; RAID support was on my > agenda anyway (albeit for MD, not for btrfs). > >> The changelog does not explain why this does not or cannot work, so I >> cannot reason about that or possibly suggest workarounds or solutions. >> But I think it should work in principle. >> > As mentioned, it really should work for zone-sized stripes. I'm not sure > we can make it to work with stripes less than zone sizes. > >> As this is first post and RFC I don't expect that everything is >> implemented, but at least the known missing points should be documented. >> You've implemented lots of the low-level zoned support and extent >> allocation, so even if the raid56 might be difficult, it should be the >> smaller part. >> > FYI, I've run a simple stress-test on a zoned device (git clone linus && > make) and haven't found any issue with those; compilation ran without a > problem, and with quite decent speed. > Good job! > > Cheers, > > Hannes
On 08/13/2018 09:29 PM, Austin S. Hemmelgarn wrote: > On 2018-08-13 15:20, Hannes Reinecke wrote: >> On 08/13/2018 08:42 PM, David Sterba wrote: >>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote: >>>> This series adds zoned block device support to btrfs. >>> >>> Yay, thanks! >>> [ .. ] >>> Device replace is disabled, but the changlog suggests there's a way to >>> make it work, so it's a matter of implementation. And this should be >>> implemented at the time of merge. >>> >> How would a device replace work in general? >> While I do understand that device replace is possible with RAID >> thingies, I somewhat fail to see how could do a device replacement >> without RAID functionality. >> Is it even possible? >> If so, how would it be different from a simple umount? > Device replace is implemented in largely the same manner as most other > live data migration tools (for example, LVM2's pvmove command). > > In short, when you issue a replace command for a given device, all > writes that would go to that device are instead sent to the new device. > While this is happening, old data is copied over from the old device to > the new one. Once all the data is copied, the old device is released > (and it's BTRFS signature wiped), and the new device has it's device ID > updated to that of the old device. > > This is possible largely because of the COW infrastructure, but it's > implemented in a way that doesn't entirely depend on it (otherwise it > wouldn't work for NOCOW files). > > Handling this on zoned devices is not likely to be easy though, you > would functionally have to freeze I/O that would hit the device being > replaced so that you don't accidentally write to a sequential zone out > of order. Ah. Oh. Hmm. It would be possible in principle if we freeze accesses to any partially filled zones on the original device. Then all new writes will be going into new/empty zones on the new disks, and we can copy over the old data with no issue at all. We end up with some partially filled zones on the new disk, but they really should be cleaned up eventually either by the allocator filling up the partially filled zones or once garbage collection clears out stale zones. However, I fear the required changes to the btrfs allocator are beyond my btrfs knowledge :-( Cheers, Hannes
On 2018-08-14 03:41, Hannes Reinecke wrote: > On 08/13/2018 09:29 PM, Austin S. Hemmelgarn wrote: >> On 2018-08-13 15:20, Hannes Reinecke wrote: >>> On 08/13/2018 08:42 PM, David Sterba wrote: >>>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote: >>>>> This series adds zoned block device support to btrfs. >>>> >>>> Yay, thanks! >>>> > [ .. ] >>>> Device replace is disabled, but the changlog suggests there's a way to >>>> make it work, so it's a matter of implementation. And this should be >>>> implemented at the time of merge. >>>> >>> How would a device replace work in general? >>> While I do understand that device replace is possible with RAID >>> thingies, I somewhat fail to see how could do a device replacement >>> without RAID functionality. >>> Is it even possible? >>> If so, how would it be different from a simple umount? >> Device replace is implemented in largely the same manner as most other >> live data migration tools (for example, LVM2's pvmove command). >> >> In short, when you issue a replace command for a given device, all >> writes that would go to that device are instead sent to the new device. >> While this is happening, old data is copied over from the old device to >> the new one. Once all the data is copied, the old device is released >> (and it's BTRFS signature wiped), and the new device has it's device ID >> updated to that of the old device. >> >> This is possible largely because of the COW infrastructure, but it's >> implemented in a way that doesn't entirely depend on it (otherwise it >> wouldn't work for NOCOW files). >> >> Handling this on zoned devices is not likely to be easy though, you >> would functionally have to freeze I/O that would hit the device being >> replaced so that you don't accidentally write to a sequential zone out >> of order. > > Ah. Oh. Hmm. > > It would be possible in principle if we freeze accesses to any partially > filled zones on the original device. Then all new writes will be going > into new/empty zones on the new disks, and we can copy over the old data > with no issue at all. > We end up with some partially filled zones on the new disk, but they > really should be cleaned up eventually either by the allocator filling > up the partially filled zones or once garbage collection clears out > stale zones. > > However, I fear the required changes to the btrfs allocator are beyond > my btrfs knowledge :-( The easy short term solution is to just disallow the replace command (with the intent of getting it working in the future), but ensure that the older style add/remove method works. That uses the balance code internally, so it should honor any restrictions on block placement for the new device, and therefore should be pretty easy to get working.
On Fri, Aug 10, 2018 at 03:28:21PM +0800, Qu Wenruo wrote: > > > On 8/10/18 2:04 AM, Naohiro Aota wrote: > > This series adds zoned block device support to btrfs. > > > > A zoned block device consists of a number of zones. Zones are either > > conventional and accepting random writes or sequential and requiring that > > writes be issued in LBA order from each zone write pointer position. > > Not familiar with zoned block device, especially for the sequential case. > > Is that sequential case tape like? It's somewhat similar but not the same as tape drives. In the tape drives, you still *can* write in random access patters, though it's much slow. In sequential required zones, it is always enforced to write sequentially in a zone. Violating sequential write rule results I/O error. One user of sequential write required zone is Host-Managed "Shingled Magnetic Recording" (SMR) HDDs [1]. They increase the volume capacity by overlapping the tracks. As a result, writing to tracks overwrites adjacent tracks. Such physical nature forces the sequential write pattern. [1] https://en.wikipedia.org/wiki/Shingled_magnetic_recording > > This > > patch series ensures that the sequential write constraint of sequential > > zones is respected while fundamentally not changing BtrFS block and I/O > > management for block stored in conventional zones. > > > > To achieve this, the default dev extent size of btrfs is changed on zoned > > block devices so that dev extents are always aligned to a zone. Allocation > > of blocks within a block group is changed so that the allocation is always > > sequential from the beginning of the block groups. To do so, an allocation > > pointer is added to block groups and used as the allocation hint. The > > allocation changes also ensures that block freed below the allocation > > pointer are ignored, resulting in sequential block allocation regardless of > > the block group usage. > > This looks like it would cause a lot of holes for metadata block groups. > It would be better to avoid metadata block allocation in such sequential > zone. > (And that would need the infrastructure to make extent allocator > priority-aware) Yes, it would introduce holes in metadata block groups. I agree it is desirable to allocate metadata blocks from conventional (non-sequential) zones. However, it's sometime impossible to allocate metadata blocks from conventional zones, since the number of conventional zones is generally smaller than sequential zones in some zoned block devices like SMR HDDs (to achieve higher volume capacity). While this patch series ensures metadata/data can be allocated in any type of zone and everything works in any zones, we will be able to improve metadata allocation by making the extent allocator priority/zone-type aware in the future. > > [...] > > Naohiro Aota (17): > > btrfs: introduce HMZONED feature flag > > btrfs: Get zone information of zoned block devices > > btrfs: Check and enable HMZONED mode > > btrfs: limit super block locations in HMZONED mode > > btrfs: disable fallocate in HMZONED mode > > btrfs: disable direct IO in HMZONED mode > > btrfs: disable device replace in HMZONED mode > > btrfs: align extent allocation to zone boundary > > According to the patch name, I though it's about extent allocation, but > in fact it's about dev extent allocation. > Renaming the patch would make more sense. > > > btrfs: do sequential allocation on HMZONED drives > > And this is the patch modifying extent allocator. Thanks. I will fix the names of the patches in the next version. > Despite that, the support zoned storage looks pretty interesting and > have something in common with planned priority-aware extent allocator. > > Thanks, > Qu Regards, Naohiro
Thank you for your review! On Mon, Aug 13, 2018 at 08:42:52PM +0200, David Sterba wrote: > On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote: > > This series adds zoned block device support to btrfs. > > Yay, thanks! > > As this a RFC, I'll give you some. The code looks ok for what it claims > to do, I'll skip style and unimportant implementation details for now as > there are bigger questions. > > The zoned devices bring some constraints so not all filesystem features > cannot be expected to work, so this rules out any form of in-place > updates like NODATACOW. > > Then there's list of 'how will zoned device work with feature X'? Here is the current HMZONED status list based on https://btrfs.wiki.kernel.org/index.php/Status Performance Trim | OK Autodefrag | OK Defrag | OK fallocate | Disabled. cannot reserve region in sequential zones direct IO | Disabled. falling back to buffered IO Compression | OK Reliability Auto-repair | not working. need to rewrite the corrupted extent Scrub | not working. need to rewrite the corrupted extent Scrub + RAID56 | not working (RAID56) nodatacow | should be disabled. (noticed it's not disabled now) Device replace | disabled for now (need to handle write pointer issues, WIP patch) Degraded mount | OK Block group profile Single | OK DUP | OK RAID0 | OK RAID1 | OK RAID10 | OK RAID56 | Disabled for now. need to avoid partial parity write. Mixed BG | OK Administration | OK Misc Free space tree | Disabled. not necessary for sequential allocator no-holes | OK skinny-metadata | OK extended-refs | OK > You disable fallocate and DIO. I haven't looked closer at the fallocate > case, but DIO could work in the sense that open() will open the file but > any write will fallback to buffered writes. This is implemented so it > would need to be wired together. Actually, it's working like that. When check_direct_IO() returns -EINVAL, btrfs_direct_IO() still returns 0. As a result, the callers fall back to buffered IO. I will reword the commit subject and log to reflect the actual behavior. Also I will relax the condition to disable only direct write IOs. > Mixed device types are not allowed, and I tend to agree with that, > though this could work in principle. Just that the chunk allocator > would have to be aware of the device types and tweaked to allocate from > the same group. The btrfs code is not ready for that in terms of the > allocator capabilities and configuration options. Yes it will work if the allocator is improved to notice device type, zone type and zone size. > Device replace is disabled, but the changlog suggests there's a way to > make it work, so it's a matter of implementation. And this should be > implemented at the time of merge. I have a WIP patch to support device replace. But it fails after device replacing due to write pointer mismatch. I'm debugging the code, so the following version may enable the feature. > RAID5/6 + zoned support is highly desired and lack of it could be > considered a NAK for the whole series. The drive sizes are expected to > be several terabytes, that sounds be too risky to lack the redundancy > options (RAID1 is not sufficient here). > > The changelog does not explain why this does not or cannot work, so I > cannot reason about that or possibly suggest workarounds or solutions. > But I think it should work in principle. > > As this is first post and RFC I don't expect that everything is > implemented, but at least the known missing points should be documented. > You've implemented lots of the low-level zoned support and extent > allocation, so even if the raid56 might be difficult, it should be the > smaller part. I was leaving RAID56 for the future, since I'm not get used to raid56 code and the its write path (raid56_parity_write) seems to be separated from the other's (submit_stripe_bio). I quick checked if RAID5 is working on current HMZONED patch. But even with simple sequential workload using dd, it made IO failures because partial parity writes introduced overwrite IOs, which violate the sequential write rule. At a quick glance at the raid56 code, I'm currently not sure how we can avoid partial parity write while dispatching necessary IOs on transaction commit. Regards, Naohiro