mbox series

[RFC,00/17] btrfs zoned block device support

Message ID 20180809180450.5091-1-naota@elisp.net (mailing list archive)
Headers show
Series btrfs zoned block device support | expand

Message

Naohiro Aota Aug. 9, 2018, 6:04 p.m. UTC
This series adds zoned block device support to btrfs.

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring that
writes be issued in LBA order from each zone write pointer position. This
patch series ensures that the sequential write constraint of sequential
zones is respected while fundamentally not changing BtrFS block and I/O
management for block stored in conventional zones.

To achieve this, the default dev extent size of btrfs is changed on zoned
block devices so that dev extents are always aligned to a zone. Allocation
of blocks within a block group is changed so that the allocation is always
sequential from the beginning of the block groups. To do so, an allocation
pointer is added to block groups and used as the allocation hint.  The
allocation changes also ensures that block freed below the allocation
pointer are ignored, resulting in sequential block allocation regardless of
the block group usage.

While the introduction of the allocation pointer ensure that blocks will be
allocated sequentially, I/Os to write out newly allocated blocks may be
issued out of order, causing errors when writing to sequential zones. This
problem s solved by introducing a submit_buffer() function and changes to
the internal I/O scheduler to ensure in-order issuing of write I/Os for
each chunk and corresponding to the block allocation order in the chunk.

The zones of a chunk are reset to allow reusing of the zone only when the
block group is being freed, that is, when all the extents of the block group
are unused.

For btrfs volumes composed of multiple zoned disks, restrictions are added
to ensure that all disks have the same zone size. This matches the existing
constraint that all dev extents in a chunk must have the same size.

It requires zoned block devices to test the patchset. Even if you don't
have zone devices, you can use tcmu-runner [1] to emulate zoned block
devices. It can export emulated zoned block devices via iSCSI. Please see
the README.md of tcmu-runner [2] for howtos to generate a zoned block
device on tcmu-runner.

[1] https://github.com/open-iscsi/tcmu-runner
[2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md

Patch 1 introduces the HMZONED incompatible feature flag to indicate that
the btrfs volume was formatted for use on zoned block devices.

Patches 2 and 3 implement functions to gather information on the zones of
the device (zones type and write pointer position).

Patch 4 restrict the possible locations of super blocks to conventional
zones to preserve the existing update in-place mechanism for the super
blocks.

Patches 5 to 7 disable features which are not compatible with the sequential
write constraints of zoned block devices. This includes fallocate and
direct I/O support. Device replace is also disabled for now.

Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
implement sequential block allocation in block groups and chunks.

Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
write I/O delivery to the device zones.

Patches 13 to 16 modify several parts of btrfs to handle free blocks
without breaking the sequential block allocation and sequential write order
as well as zone reset for unused chunks.

Finally, patch 17 adds the HMZONED feature to the list of supported
features.

Naohiro Aota (17):
  btrfs: introduce HMZONED feature flag
  btrfs: Get zone information of zoned block devices
  btrfs: Check and enable HMZONED mode
  btrfs: limit super block locations in HMZONED mode
  btrfs: disable fallocate in HMZONED mode
  btrfs: disable direct IO in HMZONED mode
  btrfs: disable device replace in HMZONED mode
  btrfs: align extent allocation to zone boundary
  btrfs: do sequential allocation on HMZONED drives
  btrfs: split btrfs_map_bio()
  btrfs: introduce submit buffer
  btrfs: expire submit buffer on timeout
  btrfs: avoid sync IO prioritization on checksum in HMZONED mode
  btrfs: redirty released extent buffers in sequential BGs
  btrfs: reset zones of unused block groups
  btrfs: wait existing extents before truncating
  btrfs: enable to mount HMZONED incompat flag

 fs/btrfs/async-thread.c     |   1 +
 fs/btrfs/async-thread.h     |   1 +
 fs/btrfs/ctree.h            |  36 ++-
 fs/btrfs/dev-replace.c      |  10 +
 fs/btrfs/disk-io.c          |  48 +++-
 fs/btrfs/extent-tree.c      | 281 +++++++++++++++++-
 fs/btrfs/extent_io.c        |   1 +
 fs/btrfs/extent_io.h        |   1 +
 fs/btrfs/file.c             |   4 +
 fs/btrfs/free-space-cache.c |  36 +++
 fs/btrfs/free-space-cache.h |  10 +
 fs/btrfs/inode.c            |  14 +
 fs/btrfs/super.c            |  32 ++-
 fs/btrfs/sysfs.c            |   2 +
 fs/btrfs/transaction.c      |  32 +++
 fs/btrfs/transaction.h      |   3 +
 fs/btrfs/volumes.c          | 551 ++++++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h          |  37 +++
 include/uapi/linux/btrfs.h  |   1 +
 19 files changed, 1061 insertions(+), 40 deletions(-)

Comments

Hannes Reinecke Aug. 10, 2018, 7:04 a.m. UTC | #1
On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
> 
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position. This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
> 
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint.  The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.
> 
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
> 
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
> 
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
> 
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
> 
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
> 
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
> 
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
> 
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
> 
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
> 
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
> 
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
> 
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
> 
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
> 
Thanks for doing all the work.
However, the patches don't apply cleanly to current master branch.
Can you please rebase them?

Thanks.

Cheers,

Hannes
Hannes Reinecke Aug. 10, 2018, 7:26 a.m. UTC | #2
On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
> 
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position. This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
> 
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint.  The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.
> 
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
> 
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
> 
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
> 
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
> 
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
> 
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
> 
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
> 
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
> 
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
> 
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
> 
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
> 
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
> 
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
> 
> Naohiro Aota (17):
>   btrfs: introduce HMZONED feature flag
>   btrfs: Get zone information of zoned block devices
>   btrfs: Check and enable HMZONED mode
>   btrfs: limit super block locations in HMZONED mode
>   btrfs: disable fallocate in HMZONED mode
>   btrfs: disable direct IO in HMZONED mode
>   btrfs: disable device replace in HMZONED mode
>   btrfs: align extent allocation to zone boundary
>   btrfs: do sequential allocation on HMZONED drives
>   btrfs: split btrfs_map_bio()
>   btrfs: introduce submit buffer
>   btrfs: expire submit buffer on timeout
>   btrfs: avoid sync IO prioritization on checksum in HMZONED mode
>   btrfs: redirty released extent buffers in sequential BGs
>   btrfs: reset zones of unused block groups
>   btrfs: wait existing extents before truncating
>   btrfs: enable to mount HMZONED incompat flag
> 
And unfortunately this series fails to boot for me:

BTRFS error (device nvme0n1p5): zoned devices mixed with regular devices
BTRFS error (device nvme0n1p5): failed to init hmzoned mode: -22
BTRFS error (device nvme0n1p5): open_ctree failed

Needless to say, /dev/nvme0n1p5 is _not_ a zoned device.
Nor has the zoned device a btrfs superblock ATM.

Cheers,

Hannes
Qu Wenruo Aug. 10, 2018, 7:28 a.m. UTC | #3
On 8/10/18 2:04 AM, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
> 
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position.

Not familiar with zoned block device, especially for the sequential case.

Is that sequential case tape like?

> This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
> 
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint.  The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.

This looks like it would cause a lot of holes for metadata block groups.
It would be better to avoid metadata block allocation in such sequential
zone.
(And that would need the infrastructure to make extent allocator
priority-aware)

> 
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
> 
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
> 
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
> 
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
> 
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
> 
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
> 
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
> 
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
> 
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
> 
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
> 
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
> 
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
> 
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
> 
> Naohiro Aota (17):
>   btrfs: introduce HMZONED feature flag
>   btrfs: Get zone information of zoned block devices
>   btrfs: Check and enable HMZONED mode
>   btrfs: limit super block locations in HMZONED mode
>   btrfs: disable fallocate in HMZONED mode
>   btrfs: disable direct IO in HMZONED mode
>   btrfs: disable device replace in HMZONED mode
>   btrfs: align extent allocation to zone boundary

According to the patch name, I though it's about extent allocation, but
in fact it's about dev extent allocation.
Renaming the patch would make more sense.

>   btrfs: do sequential allocation on HMZONED drives

And this is the patch modifying extent allocator.

Despite that, the support zoned storage looks pretty interesting and
have something in common with planned priority-aware extent allocator.

Thanks,
Qu

>   btrfs: split btrfs_map_bio()
>   btrfs: introduce submit buffer
>   btrfs: expire submit buffer on timeout
>   btrfs: avoid sync IO prioritization on checksum in HMZONED mode
>   btrfs: redirty released extent buffers in sequential BGs
>   btrfs: reset zones of unused block groups
>   btrfs: wait existing extents before truncating
>   btrfs: enable to mount HMZONED incompat flag
> 
>  fs/btrfs/async-thread.c     |   1 +
>  fs/btrfs/async-thread.h     |   1 +
>  fs/btrfs/ctree.h            |  36 ++-
>  fs/btrfs/dev-replace.c      |  10 +
>  fs/btrfs/disk-io.c          |  48 +++-
>  fs/btrfs/extent-tree.c      | 281 +++++++++++++++++-
>  fs/btrfs/extent_io.c        |   1 +
>  fs/btrfs/extent_io.h        |   1 +
>  fs/btrfs/file.c             |   4 +
>  fs/btrfs/free-space-cache.c |  36 +++
>  fs/btrfs/free-space-cache.h |  10 +
>  fs/btrfs/inode.c            |  14 +
>  fs/btrfs/super.c            |  32 ++-
>  fs/btrfs/sysfs.c            |   2 +
>  fs/btrfs/transaction.c      |  32 +++
>  fs/btrfs/transaction.h      |   3 +
>  fs/btrfs/volumes.c          | 551 ++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h          |  37 +++
>  include/uapi/linux/btrfs.h  |   1 +
>  19 files changed, 1061 insertions(+), 40 deletions(-)
>
Nikolay Borisov Aug. 10, 2018, 7:53 a.m. UTC | #4
On  9.08.2018 21:04, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.
> 
> A zoned block device consists of a number of zones. Zones are either
> conventional and accepting random writes or sequential and requiring that
> writes be issued in LBA order from each zone write pointer position. This
> patch series ensures that the sequential write constraint of sequential
> zones is respected while fundamentally not changing BtrFS block and I/O
> management for block stored in conventional zones.
> 
> To achieve this, the default dev extent size of btrfs is changed on zoned
> block devices so that dev extents are always aligned to a zone. Allocation
> of blocks within a block group is changed so that the allocation is always
> sequential from the beginning of the block groups. To do so, an allocation
> pointer is added to block groups and used as the allocation hint.  The
> allocation changes also ensures that block freed below the allocation
> pointer are ignored, resulting in sequential block allocation regardless of
> the block group usage.
> 
> While the introduction of the allocation pointer ensure that blocks will be
> allocated sequentially, I/Os to write out newly allocated blocks may be
> issued out of order, causing errors when writing to sequential zones. This
> problem s solved by introducing a submit_buffer() function and changes to
> the internal I/O scheduler to ensure in-order issuing of write I/Os for
> each chunk and corresponding to the block allocation order in the chunk.
> 
> The zones of a chunk are reset to allow reusing of the zone only when the
> block group is being freed, that is, when all the extents of the block group
> are unused.
> 
> For btrfs volumes composed of multiple zoned disks, restrictions are added
> to ensure that all disks have the same zone size. This matches the existing
> constraint that all dev extents in a chunk must have the same size.
> 
> It requires zoned block devices to test the patchset. Even if you don't
> have zone devices, you can use tcmu-runner [1] to emulate zoned block
> devices. It can export emulated zoned block devices via iSCSI. Please see
> the README.md of tcmu-runner [2] for howtos to generate a zoned block
> device on tcmu-runner.
> 
> [1] https://github.com/open-iscsi/tcmu-runner
> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
> 
> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> the btrfs volume was formatted for use on zoned block devices.
> 
> Patches 2 and 3 implement functions to gather information on the zones of
> the device (zones type and write pointer position).
> 
> Patch 4 restrict the possible locations of super blocks to conventional
> zones to preserve the existing update in-place mechanism for the super
> blocks.
> 
> Patches 5 to 7 disable features which are not compatible with the sequential
> write constraints of zoned block devices. This includes fallocate and
> direct I/O support. Device replace is also disabled for now.
> 
> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> implement sequential block allocation in block groups and chunks.
> 
> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> write I/O delivery to the device zones.
> 
> Patches 13 to 16 modify several parts of btrfs to handle free blocks
> without breaking the sequential block allocation and sequential write order
> as well as zone reset for unused chunks.
> 
> Finally, patch 17 adds the HMZONED feature to the list of supported
> features.
> 
> Naohiro Aota (17):
>   btrfs: introduce HMZONED feature flag
>   btrfs: Get zone information of zoned block devices
>   btrfs: Check and enable HMZONED mode
>   btrfs: limit super block locations in HMZONED mode
>   btrfs: disable fallocate in HMZONED mode
>   btrfs: disable direct IO in HMZONED mode
>   btrfs: disable device replace in HMZONED mode
>   btrfs: align extent allocation to zone boundary
>   btrfs: do sequential allocation on HMZONED drives
>   btrfs: split btrfs_map_bio()
>   btrfs: introduce submit buffer
>   btrfs: expire submit buffer on timeout
>   btrfs: avoid sync IO prioritization on checksum in HMZONED mode
>   btrfs: redirty released extent buffers in sequential BGs
>   btrfs: reset zones of unused block groups
>   btrfs: wait existing extents before truncating
>   btrfs: enable to mount HMZONED incompat flag
> 
>  fs/btrfs/async-thread.c     |   1 +
>  fs/btrfs/async-thread.h     |   1 +
>  fs/btrfs/ctree.h            |  36 ++-
>  fs/btrfs/dev-replace.c      |  10 +
>  fs/btrfs/disk-io.c          |  48 +++-
>  fs/btrfs/extent-tree.c      | 281 +++++++++++++++++-
>  fs/btrfs/extent_io.c        |   1 +
>  fs/btrfs/extent_io.h        |   1 +
>  fs/btrfs/file.c             |   4 +
>  fs/btrfs/free-space-cache.c |  36 +++
>  fs/btrfs/free-space-cache.h |  10 +
>  fs/btrfs/inode.c            |  14 +
>  fs/btrfs/super.c            |  32 ++-
>  fs/btrfs/sysfs.c            |   2 +
>  fs/btrfs/transaction.c      |  32 +++
>  fs/btrfs/transaction.h      |   3 +
>  fs/btrfs/volumes.c          | 551 ++++++++++++++++++++++++++++++++++--
>  fs/btrfs/volumes.h          |  37 +++
>  include/uapi/linux/btrfs.h  |   1 +
>  19 files changed, 1061 insertions(+), 40 deletions(-)
> 

There are multiple places where you do naked shifts by
ilog2(sectorsize). There is a perfectly well named define: SECTOR_SHIFT
which a lot more informative for someone who doesn't necessarily have
experience with linux storage/fs layers. Please fix such occurrences of
magic values shifting.
Nikolay Borisov Aug. 10, 2018, 7:55 a.m. UTC | #5
On 10.08.2018 10:53, Nikolay Borisov wrote:
> 
> 
> On  9.08.2018 21:04, Naohiro Aota wrote:
>> This series adds zoned block device support to btrfs.
>>
>> A zoned block device consists of a number of zones. Zones are either
>> conventional and accepting random writes or sequential and requiring that
>> writes be issued in LBA order from each zone write pointer position. This
>> patch series ensures that the sequential write constraint of sequential
>> zones is respected while fundamentally not changing BtrFS block and I/O
>> management for block stored in conventional zones.
>>
>> To achieve this, the default dev extent size of btrfs is changed on zoned
>> block devices so that dev extents are always aligned to a zone. Allocation
>> of blocks within a block group is changed so that the allocation is always
>> sequential from the beginning of the block groups. To do so, an allocation
>> pointer is added to block groups and used as the allocation hint.  The
>> allocation changes also ensures that block freed below the allocation
>> pointer are ignored, resulting in sequential block allocation regardless of
>> the block group usage.
>>
>> While the introduction of the allocation pointer ensure that blocks will be
>> allocated sequentially, I/Os to write out newly allocated blocks may be
>> issued out of order, causing errors when writing to sequential zones. This
>> problem s solved by introducing a submit_buffer() function and changes to
>> the internal I/O scheduler to ensure in-order issuing of write I/Os for
>> each chunk and corresponding to the block allocation order in the chunk.
>>
>> The zones of a chunk are reset to allow reusing of the zone only when the
>> block group is being freed, that is, when all the extents of the block group
>> are unused.
>>
>> For btrfs volumes composed of multiple zoned disks, restrictions are added
>> to ensure that all disks have the same zone size. This matches the existing
>> constraint that all dev extents in a chunk must have the same size.
>>
>> It requires zoned block devices to test the patchset. Even if you don't
>> have zone devices, you can use tcmu-runner [1] to emulate zoned block
>> devices. It can export emulated zoned block devices via iSCSI. Please see
>> the README.md of tcmu-runner [2] for howtos to generate a zoned block
>> device on tcmu-runner.
>>
>> [1] https://github.com/open-iscsi/tcmu-runner
>> [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
>>
>> Patch 1 introduces the HMZONED incompatible feature flag to indicate that
>> the btrfs volume was formatted for use on zoned block devices.
>>
>> Patches 2 and 3 implement functions to gather information on the zones of
>> the device (zones type and write pointer position).
>>
>> Patch 4 restrict the possible locations of super blocks to conventional
>> zones to preserve the existing update in-place mechanism for the super
>> blocks.
>>
>> Patches 5 to 7 disable features which are not compatible with the sequential
>> write constraints of zoned block devices. This includes fallocate and
>> direct I/O support. Device replace is also disabled for now.
>>
>> Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
>> implement sequential block allocation in block groups and chunks.
>>
>> Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
>> write I/O delivery to the device zones.
>>
>> Patches 13 to 16 modify several parts of btrfs to handle free blocks
>> without breaking the sequential block allocation and sequential write order
>> as well as zone reset for unused chunks.
>>
>> Finally, patch 17 adds the HMZONED feature to the list of supported
>> features.
>>
>> Naohiro Aota (17):
>>   btrfs: introduce HMZONED feature flag
>>   btrfs: Get zone information of zoned block devices
>>   btrfs: Check and enable HMZONED mode
>>   btrfs: limit super block locations in HMZONED mode
>>   btrfs: disable fallocate in HMZONED mode
>>   btrfs: disable direct IO in HMZONED mode
>>   btrfs: disable device replace in HMZONED mode
>>   btrfs: align extent allocation to zone boundary
>>   btrfs: do sequential allocation on HMZONED drives
>>   btrfs: split btrfs_map_bio()
>>   btrfs: introduce submit buffer
>>   btrfs: expire submit buffer on timeout
>>   btrfs: avoid sync IO prioritization on checksum in HMZONED mode
>>   btrfs: redirty released extent buffers in sequential BGs
>>   btrfs: reset zones of unused block groups
>>   btrfs: wait existing extents before truncating
>>   btrfs: enable to mount HMZONED incompat flag
>>
>>  fs/btrfs/async-thread.c     |   1 +
>>  fs/btrfs/async-thread.h     |   1 +
>>  fs/btrfs/ctree.h            |  36 ++-
>>  fs/btrfs/dev-replace.c      |  10 +
>>  fs/btrfs/disk-io.c          |  48 +++-
>>  fs/btrfs/extent-tree.c      | 281 +++++++++++++++++-
>>  fs/btrfs/extent_io.c        |   1 +
>>  fs/btrfs/extent_io.h        |   1 +
>>  fs/btrfs/file.c             |   4 +
>>  fs/btrfs/free-space-cache.c |  36 +++
>>  fs/btrfs/free-space-cache.h |  10 +
>>  fs/btrfs/inode.c            |  14 +
>>  fs/btrfs/super.c            |  32 ++-
>>  fs/btrfs/sysfs.c            |   2 +
>>  fs/btrfs/transaction.c      |  32 +++
>>  fs/btrfs/transaction.h      |   3 +
>>  fs/btrfs/volumes.c          | 551 ++++++++++++++++++++++++++++++++++--
>>  fs/btrfs/volumes.h          |  37 +++
>>  include/uapi/linux/btrfs.h  |   1 +
>>  19 files changed, 1061 insertions(+), 40 deletions(-)
>>
> 
> There are multiple places where you do naked shifts by
> ilog2(sectorsize). There is a perfectly well named define: SECTOR_SHIFT
> which a lot more informative for someone who doesn't necessarily have
> experience with linux storage/fs layers. Please fix such occurrences of
> magic values shifting.
> 

And Hannes just reminded me that this lannded in commit :
233bde21aa43 ("block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into
<linux/blkdev.h>")

This March so it might fairly recent depending on the tree you've based
your work on.
Hans van Kranenburg Aug. 10, 2018, 1:32 p.m. UTC | #6
On 08/10/2018 09:28 AM, Qu Wenruo wrote:
> 
> 
> On 8/10/18 2:04 AM, Naohiro Aota wrote:
>> This series adds zoned block device support to btrfs.
>>
>> [...]
> 
> And this is the patch modifying extent allocator.
> 
> Despite that, the support zoned storage looks pretty interesting and
> have something in common with planned priority-aware extent allocator.

Priority-aware allocator? Is someone actually working on that, or is it
planned like everything is 'planned' (i.e. nice idea, and might happen
or might as well not happen ever, SIYH)?
Qu Wenruo Aug. 10, 2018, 2:04 p.m. UTC | #7
On 8/10/18 9:32 PM, Hans van Kranenburg wrote:
> On 08/10/2018 09:28 AM, Qu Wenruo wrote:
>>
>>
>> On 8/10/18 2:04 AM, Naohiro Aota wrote:
>>> This series adds zoned block device support to btrfs.
>>>
>>> [...]
>>
>> And this is the patch modifying extent allocator.
>>
>> Despite that, the support zoned storage looks pretty interesting and
>> have something in common with planned priority-aware extent allocator.
> 
> Priority-aware allocator? Is someone actually working on that, or is it
> planned like everything is 'planned' (i.e. nice idea, and might happen
> or might as well not happen ever, SIYH)?

I'm working on this, although it will take some time.

Although it's originally designed to solve the problem where some empty
block groups won't be freed due to pinned bytes.

Thanks,
Qu
>
Naohiro Aota Aug. 10, 2018, 2:24 p.m. UTC | #8
On Fri, Aug 10, 2018 at 09:04:59AM +0200, Hannes Reinecke wrote:
> On 08/09/2018 08:04 PM, Naohiro Aota wrote:
> > This series adds zoned block device support to btrfs.
> > 
> > A zoned block device consists of a number of zones. Zones are either
> > conventional and accepting random writes or sequential and requiring that
> > writes be issued in LBA order from each zone write pointer position. This
> > patch series ensures that the sequential write constraint of sequential
> > zones is respected while fundamentally not changing BtrFS block and I/O
> > management for block stored in conventional zones.
> > 
> > To achieve this, the default dev extent size of btrfs is changed on zoned
> > block devices so that dev extents are always aligned to a zone. Allocation
> > of blocks within a block group is changed so that the allocation is always
> > sequential from the beginning of the block groups. To do so, an allocation
> > pointer is added to block groups and used as the allocation hint.  The
> > allocation changes also ensures that block freed below the allocation
> > pointer are ignored, resulting in sequential block allocation regardless of
> > the block group usage.
> > 
> > While the introduction of the allocation pointer ensure that blocks will be
> > allocated sequentially, I/Os to write out newly allocated blocks may be
> > issued out of order, causing errors when writing to sequential zones. This
> > problem s solved by introducing a submit_buffer() function and changes to
> > the internal I/O scheduler to ensure in-order issuing of write I/Os for
> > each chunk and corresponding to the block allocation order in the chunk.
> > 
> > The zones of a chunk are reset to allow reusing of the zone only when the
> > block group is being freed, that is, when all the extents of the block group
> > are unused.
> > 
> > For btrfs volumes composed of multiple zoned disks, restrictions are added
> > to ensure that all disks have the same zone size. This matches the existing
> > constraint that all dev extents in a chunk must have the same size.
> > 
> > It requires zoned block devices to test the patchset. Even if you don't
> > have zone devices, you can use tcmu-runner [1] to emulate zoned block
> > devices. It can export emulated zoned block devices via iSCSI. Please see
> > the README.md of tcmu-runner [2] for howtos to generate a zoned block
> > device on tcmu-runner.
> > 
> > [1] https://github.com/open-iscsi/tcmu-runner
> > [2] https://github.com/open-iscsi/tcmu-runner/blob/master/README.md
> > 
> > Patch 1 introduces the HMZONED incompatible feature flag to indicate that
> > the btrfs volume was formatted for use on zoned block devices.
> > 
> > Patches 2 and 3 implement functions to gather information on the zones of
> > the device (zones type and write pointer position).
> > 
> > Patch 4 restrict the possible locations of super blocks to conventional
> > zones to preserve the existing update in-place mechanism for the super
> > blocks.
> > 
> > Patches 5 to 7 disable features which are not compatible with the sequential
> > write constraints of zoned block devices. This includes fallocate and
> > direct I/O support. Device replace is also disabled for now.
> > 
> > Patches 8 and 9 tweak the extent buffer allocation for HMZONED mode to
> > implement sequential block allocation in block groups and chunks.
> > 
> > Patches 10 to 12 implement the new submit buffer I/O path to ensure sequential
> > write I/O delivery to the device zones.
> > 
> > Patches 13 to 16 modify several parts of btrfs to handle free blocks
> > without breaking the sequential block allocation and sequential write order
> > as well as zone reset for unused chunks.
> > 
> > Finally, patch 17 adds the HMZONED feature to the list of supported
> > features.
> > 
> Thanks for doing all the work.
> However, the patches don't apply cleanly to current master branch.
> Can you please rebase them?

I'm currently basing on
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
for-next branch, since my previous bug-fix patch 266e010932ce ("btrfs:
revert fs_devices state on error of btrfs_init_new_device") is
necessary to avoid use-after-free bug in error handling path of
btrfs_init_new_device() in the patch 2. I'm sorry for not mentioning
it.

I'll rebase on the master branch when the patch reach the master.

Regards,
Naohiro

> Thanks.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		               zSeries & Storage
> hare@suse.com			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
David Sterba Aug. 13, 2018, 6:42 p.m. UTC | #9
On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
> This series adds zoned block device support to btrfs.

Yay, thanks!

As this a RFC, I'll give you some. The code looks ok for what it claims
to do, I'll skip style and unimportant implementation details for now as
there are bigger questions.

The zoned devices bring some constraints so not all filesystem features
cannot be expected to work, so this rules out any form of in-place
updates like NODATACOW.

Then there's list of 'how will zoned device work with feature X'?

You disable fallocate and DIO. I haven't looked closer at the fallocate
case, but DIO could work in the sense that open() will open the file but
any write will fallback to buffered writes. This is implemented so it
would need to be wired together.

Mixed device types are not allowed, and I tend to agree with that,
though this could work in principle.  Just that the chunk allocator
would have to be aware of the device types and tweaked to allocate from
the same group. The btrfs code is not ready for that in terms of the
allocator capabilities and configuration options.

Device replace is disabled, but the changlog suggests there's a way to
make it work, so it's a matter of implementation. And this should be
implemented at the time of merge.

RAID5/6 + zoned support is highly desired and lack of it could be
considered a NAK for the whole series. The drive sizes are expected to
be several terabytes, that sounds be too risky to lack the redundancy
options (RAID1 is not sufficient here).

The changelog does not explain why this does not or cannot work, so I
cannot reason about that or possibly suggest workarounds or solutions.
But I think it should work in principle.

As this is first post and RFC I don't expect that everything is
implemented, but at least the known missing points should be documented.
You've implemented lots of the low-level zoned support and extent
allocation, so even if the raid56 might be difficult, it should be the
smaller part.
Hannes Reinecke Aug. 13, 2018, 7:20 p.m. UTC | #10
On 08/13/2018 08:42 PM, David Sterba wrote:
> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>> This series adds zoned block device support to btrfs.
> 
> Yay, thanks!
> 
> As this a RFC, I'll give you some. The code looks ok for what it claims
> to do, I'll skip style and unimportant implementation details for now as
> there are bigger questions.
> 
> The zoned devices bring some constraints so not all filesystem features
> cannot be expected to work, so this rules out any form of in-place
> updates like NODATACOW.
> 
> Then there's list of 'how will zoned device work with feature X'?
> 
> You disable fallocate and DIO. I haven't looked closer at the fallocate
> case, but DIO could work in the sense that open() will open the file but
> any write will fallback to buffered writes. This is implemented so it
> would need to be wired together.
> 
> Mixed device types are not allowed, and I tend to agree with that,
> though this could work in principle.  Just that the chunk allocator
> would have to be aware of the device types and tweaked to allocate from
> the same group. The btrfs code is not ready for that in terms of the
> allocator capabilities and configuration options.
> 
> Device replace is disabled, but the changlog suggests there's a way to
> make it work, so it's a matter of implementation. And this should be
> implemented at the time of merge.
> 
How would a device replace work in general?
While I do understand that device replace is possible with RAID 
thingies, I somewhat fail to see how could do a device replacement 
without RAID functionality.
Is it even possible?
If so, how would it be different from a simple umount?

> RAID5/6 + zoned support is highly desired and lack of it could be
> considered a NAK for the whole series. The drive sizes are expected to
> be several terabytes, that sounds be too risky to lack the redundancy
> options (RAID1 is not sufficient here).
> 
That really depends on the allocator.
If we can make the RAID code to work with zone-sized stripes it should 
be pretty trivial. I can have a look at that; RAID support was on my 
agenda anyway (albeit for MD, not for btrfs).

> The changelog does not explain why this does not or cannot work, so I
> cannot reason about that or possibly suggest workarounds or solutions.
> But I think it should work in principle.
> 
As mentioned, it really should work for zone-sized stripes. I'm not sure 
we can make it to work with stripes less than zone sizes.

> As this is first post and RFC I don't expect that everything is
> implemented, but at least the known missing points should be documented.
> You've implemented lots of the low-level zoned support and extent
> allocation, so even if the raid56 might be difficult, it should be the
> smaller part.
> 
FYI, I've run a simple stress-test on a zoned device (git clone linus && 
make) and haven't found any issue with those; compilation ran without a 
problem, and with quite decent speed.
Good job!

Cheers,

Hannes
Austin S. Hemmelgarn Aug. 13, 2018, 7:29 p.m. UTC | #11
On 2018-08-13 15:20, Hannes Reinecke wrote:
> On 08/13/2018 08:42 PM, David Sterba wrote:
>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>>> This series adds zoned block device support to btrfs.
>>
>> Yay, thanks!
>>
>> As this a RFC, I'll give you some. The code looks ok for what it claims
>> to do, I'll skip style and unimportant implementation details for now as
>> there are bigger questions.
>>
>> The zoned devices bring some constraints so not all filesystem features
>> cannot be expected to work, so this rules out any form of in-place
>> updates like NODATACOW.
>>
>> Then there's list of 'how will zoned device work with feature X'?
>>
>> You disable fallocate and DIO. I haven't looked closer at the fallocate
>> case, but DIO could work in the sense that open() will open the file but
>> any write will fallback to buffered writes. This is implemented so it
>> would need to be wired together.
>>
>> Mixed device types are not allowed, and I tend to agree with that,
>> though this could work in principle.  Just that the chunk allocator
>> would have to be aware of the device types and tweaked to allocate from
>> the same group. The btrfs code is not ready for that in terms of the
>> allocator capabilities and configuration options.
>>
>> Device replace is disabled, but the changlog suggests there's a way to
>> make it work, so it's a matter of implementation. And this should be
>> implemented at the time of merge.
>>
> How would a device replace work in general?
> While I do understand that device replace is possible with RAID 
> thingies, I somewhat fail to see how could do a device replacement 
> without RAID functionality.
> Is it even possible?
> If so, how would it be different from a simple umount?
Device replace is implemented in largely the same manner as most other 
live data migration tools (for example, LVM2's pvmove command).

In short, when you issue a replace command for a given device, all 
writes that would go to that device are instead sent to the new device. 
While this is happening, old data is copied over from the old device to 
the new one.  Once all the data is copied, the old device is released 
(and it's BTRFS signature wiped), and the new device has it's device ID 
updated to that of the old device.

This is possible largely because of the COW infrastructure, but it's 
implemented in a way that doesn't entirely depend on it (otherwise it 
wouldn't work for NOCOW files).

Handling this on zoned devices is not likely to be easy though, you 
would functionally have to freeze I/O that would hit the device being 
replaced so that you don't accidentally write to a sequential zone out 
of order.
> 
>> RAID5/6 + zoned support is highly desired and lack of it could be
>> considered a NAK for the whole series. The drive sizes are expected to
>> be several terabytes, that sounds be too risky to lack the redundancy
>> options (RAID1 is not sufficient here).
>>
> That really depends on the allocator.
> If we can make the RAID code to work with zone-sized stripes it should 
> be pretty trivial. I can have a look at that; RAID support was on my 
> agenda anyway (albeit for MD, not for btrfs).
> 
>> The changelog does not explain why this does not or cannot work, so I
>> cannot reason about that or possibly suggest workarounds or solutions.
>> But I think it should work in principle.
>>
> As mentioned, it really should work for zone-sized stripes. I'm not sure 
> we can make it to work with stripes less than zone sizes.
> 
>> As this is first post and RFC I don't expect that everything is
>> implemented, but at least the known missing points should be documented.
>> You've implemented lots of the low-level zoned support and extent
>> allocation, so even if the raid56 might be difficult, it should be the
>> smaller part.
>>
> FYI, I've run a simple stress-test on a zoned device (git clone linus && 
> make) and haven't found any issue with those; compilation ran without a 
> problem, and with quite decent speed.
> Good job!
> 
> Cheers,
> 
> Hannes
Hannes Reinecke Aug. 14, 2018, 7:41 a.m. UTC | #12
On 08/13/2018 09:29 PM, Austin S. Hemmelgarn wrote:
> On 2018-08-13 15:20, Hannes Reinecke wrote:
>> On 08/13/2018 08:42 PM, David Sterba wrote:
>>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>>>> This series adds zoned block device support to btrfs.
>>>
>>> Yay, thanks!
>>>
[ .. ]
>>> Device replace is disabled, but the changlog suggests there's a way to
>>> make it work, so it's a matter of implementation. And this should be
>>> implemented at the time of merge.
>>>
>> How would a device replace work in general?
>> While I do understand that device replace is possible with RAID
>> thingies, I somewhat fail to see how could do a device replacement
>> without RAID functionality.
>> Is it even possible?
>> If so, how would it be different from a simple umount?
> Device replace is implemented in largely the same manner as most other
> live data migration tools (for example, LVM2's pvmove command).
> 
> In short, when you issue a replace command for a given device, all
> writes that would go to that device are instead sent to the new device.
> While this is happening, old data is copied over from the old device to
> the new one.  Once all the data is copied, the old device is released
> (and it's BTRFS signature wiped), and the new device has it's device ID
> updated to that of the old device.
> 
> This is possible largely because of the COW infrastructure, but it's
> implemented in a way that doesn't entirely depend on it (otherwise it
> wouldn't work for NOCOW files).
> 
> Handling this on zoned devices is not likely to be easy though, you
> would functionally have to freeze I/O that would hit the device being
> replaced so that you don't accidentally write to a sequential zone out
> of order.

Ah. Oh. Hmm.

It would be possible in principle if we freeze accesses to any partially
filled zones on the original device. Then all new writes will be going
into new/empty zones on the new disks, and we can copy over the old data
with no issue at all.
We end up with some partially filled zones on the new disk, but they
really should be cleaned up eventually either by the allocator filling
up the partially filled zones or once garbage collection clears out
stale zones.

However, I fear the required changes to the btrfs allocator are beyond
my btrfs knowledge :-(

Cheers,

Hannes
Austin S. Hemmelgarn Aug. 15, 2018, 11:25 a.m. UTC | #13
On 2018-08-14 03:41, Hannes Reinecke wrote:
> On 08/13/2018 09:29 PM, Austin S. Hemmelgarn wrote:
>> On 2018-08-13 15:20, Hannes Reinecke wrote:
>>> On 08/13/2018 08:42 PM, David Sterba wrote:
>>>> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
>>>>> This series adds zoned block device support to btrfs.
>>>>
>>>> Yay, thanks!
>>>>
> [ .. ]
>>>> Device replace is disabled, but the changlog suggests there's a way to
>>>> make it work, so it's a matter of implementation. And this should be
>>>> implemented at the time of merge.
>>>>
>>> How would a device replace work in general?
>>> While I do understand that device replace is possible with RAID
>>> thingies, I somewhat fail to see how could do a device replacement
>>> without RAID functionality.
>>> Is it even possible?
>>> If so, how would it be different from a simple umount?
>> Device replace is implemented in largely the same manner as most other
>> live data migration tools (for example, LVM2's pvmove command).
>>
>> In short, when you issue a replace command for a given device, all
>> writes that would go to that device are instead sent to the new device.
>> While this is happening, old data is copied over from the old device to
>> the new one.  Once all the data is copied, the old device is released
>> (and it's BTRFS signature wiped), and the new device has it's device ID
>> updated to that of the old device.
>>
>> This is possible largely because of the COW infrastructure, but it's
>> implemented in a way that doesn't entirely depend on it (otherwise it
>> wouldn't work for NOCOW files).
>>
>> Handling this on zoned devices is not likely to be easy though, you
>> would functionally have to freeze I/O that would hit the device being
>> replaced so that you don't accidentally write to a sequential zone out
>> of order.
> 
> Ah. Oh. Hmm.
> 
> It would be possible in principle if we freeze accesses to any partially
> filled zones on the original device. Then all new writes will be going
> into new/empty zones on the new disks, and we can copy over the old data
> with no issue at all.
> We end up with some partially filled zones on the new disk, but they
> really should be cleaned up eventually either by the allocator filling
> up the partially filled zones or once garbage collection clears out
> stale zones.
> 
> However, I fear the required changes to the btrfs allocator are beyond
> my btrfs knowledge :-(
The easy short term solution is to just disallow the replace command 
(with the intent of getting it working in the future), but ensure that 
the older style add/remove method works.  That uses the balance code 
internally, so it should honor any restrictions on block placement for 
the new device, and therefore should be pretty easy to get working.
Naohiro Aota Aug. 16, 2018, 9:05 a.m. UTC | #14
On Fri, Aug 10, 2018 at 03:28:21PM +0800, Qu Wenruo wrote:
> 
> 
> On 8/10/18 2:04 AM, Naohiro Aota wrote:
> > This series adds zoned block device support to btrfs.
> > 
> > A zoned block device consists of a number of zones. Zones are either
> > conventional and accepting random writes or sequential and requiring that
> > writes be issued in LBA order from each zone write pointer position.
> 
> Not familiar with zoned block device, especially for the sequential case.
> 
> Is that sequential case tape like?

It's somewhat similar but not the same as tape drives. In the tape
drives, you still *can* write in random access patters, though it's
much slow. In sequential required zones, it is always enforced to
write sequentially in a zone. Violating sequential write rule results
I/O error.

One user of sequential write required zone is Host-Managed "Shingled
Magnetic Recording" (SMR) HDDs [1]. They increase the volume capacity
by overlapping the tracks. As a result, writing to tracks overwrites
adjacent tracks. Such physical nature forces the sequential write
pattern.

[1] https://en.wikipedia.org/wiki/Shingled_magnetic_recording

> > This
> > patch series ensures that the sequential write constraint of sequential
> > zones is respected while fundamentally not changing BtrFS block and I/O
> > management for block stored in conventional zones.
> > 
> > To achieve this, the default dev extent size of btrfs is changed on zoned
> > block devices so that dev extents are always aligned to a zone. Allocation
> > of blocks within a block group is changed so that the allocation is always
> > sequential from the beginning of the block groups. To do so, an allocation
> > pointer is added to block groups and used as the allocation hint.  The
> > allocation changes also ensures that block freed below the allocation
> > pointer are ignored, resulting in sequential block allocation regardless of
> > the block group usage.
> 
> This looks like it would cause a lot of holes for metadata block groups.
> It would be better to avoid metadata block allocation in such sequential
> zone.
> (And that would need the infrastructure to make extent allocator
> priority-aware)

Yes, it would introduce holes in metadata block groups. I agree it is
desirable to allocate metadata blocks from conventional
(non-sequential) zones.

However, it's sometime impossible to allocate metadata blocks from
conventional zones, since the number of conventional zones is
generally smaller than sequential zones in some zoned block devices
like SMR HDDs (to achieve higher volume capacity).

While this patch series ensures metadata/data can be allocated in any
type of zone and everything works in any zones, we will be able to
improve metadata allocation by making the extent allocator
priority/zone-type aware in the future.

> > [...]
> > Naohiro Aota (17):
> >   btrfs: introduce HMZONED feature flag
> >   btrfs: Get zone information of zoned block devices
> >   btrfs: Check and enable HMZONED mode
> >   btrfs: limit super block locations in HMZONED mode
> >   btrfs: disable fallocate in HMZONED mode
> >   btrfs: disable direct IO in HMZONED mode
> >   btrfs: disable device replace in HMZONED mode
> >   btrfs: align extent allocation to zone boundary
> 
> According to the patch name, I though it's about extent allocation, but
> in fact it's about dev extent allocation.
> Renaming the patch would make more sense.
>
> >   btrfs: do sequential allocation on HMZONED drives
> 
> And this is the patch modifying extent allocator.

Thanks. I will fix the names of the patches in the next version.

> Despite that, the support zoned storage looks pretty interesting and
> have something in common with planned priority-aware extent allocator.
> 
> Thanks,
> Qu

Regards,
Naohiro
Naohiro Aota Aug. 28, 2018, 10:33 a.m. UTC | #15
Thank you for your review!

On Mon, Aug 13, 2018 at 08:42:52PM +0200, David Sterba wrote:
> On Fri, Aug 10, 2018 at 03:04:33AM +0900, Naohiro Aota wrote:
> > This series adds zoned block device support to btrfs.
> 
> Yay, thanks!
> 
> As this a RFC, I'll give you some. The code looks ok for what it claims
> to do, I'll skip style and unimportant implementation details for now as
> there are bigger questions.
> 
> The zoned devices bring some constraints so not all filesystem features
> cannot be expected to work, so this rules out any form of in-place
> updates like NODATACOW.
> 
> Then there's list of 'how will zoned device work with feature X'?

Here is the current HMZONED status list based on https://btrfs.wiki.kernel.org/index.php/Status

Performance
Trim       | OK
Autodefrag | OK
Defrag     | OK
fallocate  | Disabled. cannot reserve region in sequential zones
direct IO  | Disabled. falling back to buffered IO

Compression | OK

Reliability
Auto-repair    | not working. need to rewrite the corrupted extent
Scrub          | not working. need to rewrite the corrupted extent
Scrub + RAID56 | not working (RAID56)
nodatacow      | should be disabled. (noticed it's not disabled now)
Device replace | disabled for now (need to handle write pointer issues, WIP patch)
Degraded mount | OK

Block group profile
Single   | OK
DUP      | OK
RAID0    | OK
RAID1    | OK
RAID10   | OK
RAID56   | Disabled for now. need to avoid partial parity write.
Mixed BG | OK

Administration | OK

Misc
Free space tree | Disabled. not necessary for sequential allocator
no-holes        | OK
skinny-metadata | OK
extended-refs   | OK

> You disable fallocate and DIO. I haven't looked closer at the fallocate
> case, but DIO could work in the sense that open() will open the file but
> any write will fallback to buffered writes. This is implemented so it
> would need to be wired together.

Actually, it's working like that. When check_direct_IO() returns
-EINVAL, btrfs_direct_IO() still returns 0. As a result, the callers
fall back to buffered IO.

I will reword the commit subject and log to reflect the actual
behavior. Also I will relax the condition to disable only direct write
IOs.

> Mixed device types are not allowed, and I tend to agree with that,
> though this could work in principle.  Just that the chunk allocator
> would have to be aware of the device types and tweaked to allocate from
> the same group. The btrfs code is not ready for that in terms of the
> allocator capabilities and configuration options.

Yes it will work if the allocator is improved to notice device type,
zone type and zone size.

> Device replace is disabled, but the changlog suggests there's a way to
> make it work, so it's a matter of implementation. And this should be
> implemented at the time of merge.

I have a WIP patch to support device replace. But it fails after
device replacing due to write pointer mismatch. I'm debugging the
code, so the following version may enable the feature.

> RAID5/6 + zoned support is highly desired and lack of it could be
> considered a NAK for the whole series. The drive sizes are expected to
> be several terabytes, that sounds be too risky to lack the redundancy
> options (RAID1 is not sufficient here).
> 
> The changelog does not explain why this does not or cannot work, so I
> cannot reason about that or possibly suggest workarounds or solutions.
> But I think it should work in principle.
> 
> As this is first post and RFC I don't expect that everything is
> implemented, but at least the known missing points should be documented.
> You've implemented lots of the low-level zoned support and extent
> allocation, so even if the raid56 might be difficult, it should be the
> smaller part.

I was leaving RAID56 for the future, since I'm not get used to raid56
code and the its write path (raid56_parity_write) seems to be
separated from the other's (submit_stripe_bio).

I quick checked if RAID5 is working on current HMZONED patch. But even
with simple sequential workload using dd, it made IO failures because
partial parity writes introduced overwrite IOs, which violate the
sequential write rule. At a quick glance at the raid56 code, I'm
currently not sure how we can avoid partial parity write while
dispatching necessary IOs on transaction commit.

Regards,
Naohiro