mbox series

[v2,00/19] btrfs zoned block device support

Message ID 20190607131025.31996-1-naohiro.aota@wdc.com (mailing list archive)
Headers show
Series btrfs zoned block device support | expand

Message

Naohiro Aota June 7, 2019, 1:10 p.m. UTC
btrfs zoned block device support

This series adds zoned block device support to btrfs.

A zoned block device consists of a number of zones. Zones are either
conventional and accepting random writes or sequential and requiring that
writes be issued in LBA order from each zone write pointer position. This
patch series ensures that the sequential write constraint of sequential
zones is respected while fundamentally not changing BtrFS block and I/O
management for block stored in conventional zones.

To achieve this, the default chunk size of btrfs is changed on zoned block
devices so that chunks are always aligned to a zone. Allocation of blocks
within a chunk is changed so that the allocation is always sequential from
the beginning of the chunks. To do so, an allocation pointer is added to
block groups and used as the allocation hint.  The allocation changes also
ensures that block freed below the allocation pointer are ignored,
resulting in sequential block allocation regardless of the chunk usage.

While the introduction of the allocation pointer ensure that blocks will be
allocated sequentially, I/Os to write out newly allocated blocks may be
issued out of order, causing errors when writing to sequential zones. This
problem s solved by introducing a submit_buffer() function and changes to
the internal I/O scheduler to ensure in-order issuing of write I/Os for
each chunk and corresponding to the block allocation order in the chunk.

The zone of a chunk is reset to allow reuse of the zone only when the block
group is being freed, that is, when all the chunks of the block group are
unused.

For btrfs volumes composed of multiple zoned disks, restrictions are added
to ensure that all disks have the same zone size. This matches the existing
constraint that all chunks in a block group must have the same size.

As discussed with Chris Mason in LSFMM, we enabled device replacing in
HMZONED mode. But still drop fallocate for now.

Patch 1 introduces the HMZONED incompatible feature flag to indicate that
the btrfs volume was formatted for use on zoned block devices.

Patches 2 and 3 implement functions to gather information on the zones of
the device (zones type and write pointer position).

Patches 4 and 5 disable features which are not compatible with the
sequential write constraints of zoned block devices. This includes
fallocate and direct I/O support.

Patches 6 and 7 tweak the extent buffer allocation for HMZONED mode to
implement sequential block allocation in block groups and chunks.

Patch 8 mark block group read only when write pointers of devices which
compose e.g. RAID1 block group devices are mismatch.

Patch 9 restrict the possible locations of super blocks to conventional
zones to preserve the existing update in-place mechanism for the super
blocks.

Patches 10 to 12 implement the new submit buffer I/O path to ensure
sequential write I/O delivery to the device zones.

Patches 13 to 17 modify several parts of btrfs to handle free blocks
without breaking the sequential block allocation and sequential write order
as well as zone reset for unused chunks.

Patch 18 add support for device replacing.

Finally, patch 19 adds the HMZONED feature to the list of supported
features.

This series applies on kdave/for-5.2-rc2.

Changelog
v2:
 - Add support for dev-replace
 -- To support dev-replace, moved submit_buffer one layer up. It now
    handles bio instead of btrfs_bio.
 -- Mark unmirrored Block Group readonly only when there is writable
    mirrored BGs. Necessary to handle degraded RAID.
 - Expire worker use vanilla delayed_work instead of btrfs's async-thread
 - Device extent allocator now ensure that region is on the same zone type.
 - Add delayed allocation shrinking.
 - Rename btrfs_drop_dev_zonetypes() to btrfs_destroy_dev_zonetypes
 - Fix
 -- Use SECTOR_SHIFT (Nikolay)
 -- Use btrfs_err (Nikolay)

Naohiro Aota (19):
  btrfs: introduce HMZONED feature flag
  btrfs: Get zone information of zoned block devices
  btrfs: Check and enable HMZONED mode
  btrfs: disable fallocate in HMZONED mode
  btrfs: disable direct IO in HMZONED mode
  btrfs: align dev extent allocation to zone boundary
  btrfs: do sequential extent allocation in HMZONED mode
  btrfs: make unmirroed BGs readonly only if we have at least one
    writable BG
  btrfs: limit super block locations in HMZONED mode
  btrfs: rename btrfs_map_bio()
  btrfs: introduce submit buffer
  btrfs: expire submit buffer on timeout
  btrfs: avoid sync IO prioritization on checksum in HMZONED mode
  btrfs: redirty released extent buffers in sequential BGs
  btrfs: reset zones of unused block groups
  btrfs: wait existing extents before truncating
  btrfs: shrink delayed allocation size in HMZONED mode
  btrfs: support dev-replace in HMZONED mode
  btrfs: enable to mount HMZONED incompat flag

 fs/btrfs/ctree.h             |  47 ++-
 fs/btrfs/dev-replace.c       | 103 ++++++
 fs/btrfs/disk-io.c           |  49 ++-
 fs/btrfs/disk-io.h           |   1 +
 fs/btrfs/extent-tree.c       | 479 +++++++++++++++++++++++-
 fs/btrfs/extent_io.c         |  28 ++
 fs/btrfs/extent_io.h         |   2 +
 fs/btrfs/file.c              |   4 +
 fs/btrfs/free-space-cache.c  |  33 ++
 fs/btrfs/free-space-cache.h  |   5 +
 fs/btrfs/inode.c             |  14 +
 fs/btrfs/scrub.c             | 171 +++++++++
 fs/btrfs/super.c             |  30 +-
 fs/btrfs/sysfs.c             |   2 +
 fs/btrfs/transaction.c       |  35 ++
 fs/btrfs/transaction.h       |   3 +
 fs/btrfs/volumes.c           | 684 ++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.h           |  37 ++
 include/trace/events/btrfs.h |  43 +++
 include/uapi/linux/btrfs.h   |   1 +
 20 files changed, 1734 insertions(+), 37 deletions(-)

Comments

David Sterba June 12, 2019, 5:51 p.m. UTC | #1
On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
> btrfs zoned block device support
> 
> This series adds zoned block device support to btrfs.

The overall design sounds ok.

I skimmed through the patches and the biggest task I see is how to make
the hmzoned adjustments and branches less visible, ie. there are too
many if (hmzoned) { do something } standing out. But that's merely a
matter of wrappers and maybe an abstraction here and there.

How can I test the zoned devices backed by files (or regular disks)? I
searched for some concrete example eg. for qemu or dm-zoned, but closest
match was a text description in libzbc README that it's possible to
implement. All other howtos expect a real zoned device.

Merge target is 5.3 or later, we'll see how things will go. I'm
expecting that we might need some time to get feedback about the
usability as there's no previous work widely used that we can build on
top of.
Naohiro Aota June 13, 2019, 4:59 a.m. UTC | #2
On 2019/06/13 2:50, David Sterba wrote:
> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>> btrfs zoned block device support
>>
>> This series adds zoned block device support to btrfs.
> 
> The overall design sounds ok.
> 
> I skimmed through the patches and the biggest task I see is how to make
> the hmzoned adjustments and branches less visible, ie. there are too
> many if (hmzoned) { do something } standing out. But that's merely a
> matter of wrappers and maybe an abstraction here and there.

Sure. I'll add some more abstractions in the next version.

> How can I test the zoned devices backed by files (or regular disks)? I
> searched for some concrete example eg. for qemu or dm-zoned, but closest
> match was a text description in libzbc README that it's possible to
> implement. All other howtos expect a real zoned device.

You can use tcmu-runer [1] to create an emulated zoned device backed by 
a regular file. Here is a setup how-to:
http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation

[1] https://github.com/open-iscsi/tcmu-runner

> Merge target is 5.3 or later, we'll see how things will go. I'm
> expecting that we might need some time to get feedback about the
> usability as there's no previous work widely used that we can build on
> top of.
>
David Sterba June 13, 2019, 1:46 p.m. UTC | #3
On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
> On 2019/06/13 2:50, David Sterba wrote:
> > On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
> >> btrfs zoned block device support
> >>
> >> This series adds zoned block device support to btrfs.
> > 
> > The overall design sounds ok.
> > 
> > I skimmed through the patches and the biggest task I see is how to make
> > the hmzoned adjustments and branches less visible, ie. there are too
> > many if (hmzoned) { do something } standing out. But that's merely a
> > matter of wrappers and maybe an abstraction here and there.
> 
> Sure. I'll add some more abstractions in the next version.

Ok, I'll reply to the patches with specific things.

> > How can I test the zoned devices backed by files (or regular disks)? I
> > searched for some concrete example eg. for qemu or dm-zoned, but closest
> > match was a text description in libzbc README that it's possible to
> > implement. All other howtos expect a real zoned device.
> 
> You can use tcmu-runer [1] to create an emulated zoned device backed by 
> a regular file. Here is a setup how-to:
> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation

That looks great, thanks. I wonder why there's no way to find that, all
I got were dead links to linux-iscsi.org or tutorials of targetcli that
were years old and not working.

Feeding the textual commands to targetcli is not exactly what I'd
expect for scripting, but at least it seems to work.

I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
device) but lsscsi does not recognize that it as a zonde device (just a
QEMU harddisk). So this seems the emulation must be done inside the VM.
Naohiro Aota June 14, 2019, 2:07 a.m. UTC | #4
On 2019/06/13 22:45, David Sterba wrote:> On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
>> On 2019/06/13 2:50, David Sterba wrote:
>>> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>>> How can I test the zoned devices backed by files (or regular disks)? I
>>> searched for some concrete example eg. for qemu or dm-zoned, but closest
>>> match was a text description in libzbc README that it's possible to
>>> implement. All other howtos expect a real zoned device.
>>
>> You can use tcmu-runer [1] to create an emulated zoned device backed by
>> a regular file. Here is a setup how-to:
>> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation>> That looks great, thanks. I wonder why there's no way to find that, all
> I got were dead links to linux-iscsi.org or tutorials of targetcli that
> were years old and not working.

Actually, this is quite new site. ;-)

> Feeding the textual commands to targetcli is not exactly what I'd
> expect for scripting, but at least it seems to work.

You can use "targetcli <directory> <command> [<args> ...]" format, so
you can call e.g.

targetcli /backstores/user:zbc create name=foo size=10G cfgstring=model-HM/zsize-256/conv-1@/mnt/nvme/disk0.raw

> I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
> device) but lsscsi does not recognize that it as a zonde device (just a
> QEMU harddisk). So this seems the emulation must be done inside the VM.

Oops, QEMU hide the detail.

In this case, you can try exposing the ZBC device via iSCSI.

On the host:
(after creating the ZBC backstores)
# sudo targetcli /iscsi create
Created target iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c.
Created TPG 1.
Global pref auto_add_default_portal=true
Created default portal listening on all IPs (0.0.0.0), port 3260.
# TARGET="iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c"

(WARN: Allow any node to connect without any auth)
# targetcli /iscsi/${TARGET}/tpg1 set attribute generate_node_acls=1
Parameter generate_node_acls is now '1'.
( or you can explicitly allow an initiator)
# TCMU_INITIATOR=iqn.2018-07....
# targecli /iscsi/${TARGET}/tpg1/acls create ${TCMU_INITIATOR}

(for each backend)
# targetcli /iscsi/${TARGET}/tpg1/luns create /backstores/user:zbc/foo
Created LUN 0.

Then, you can login to the iSCSI on the KVM guest like:

# iscsiadm -m discovery -t st -p $HOST_IP
127.0.0.1:3260,1 iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c
# iscsiadm -m node -l -T ${TARGET}
Logging in to [iface: default, target: iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c, portal: 127.0.0.1,3260]
Login to [iface: default, target: iqn.2003-01.org.linux-iscsi.naota-devel.x8664:sn.f4f308e4892c, portal: 127.0.0.1,3260] successful.
Damien Le Moal June 17, 2019, 2:44 a.m. UTC | #5
David,

On 2019/06/13 22:45, David Sterba wrote:
> On Thu, Jun 13, 2019 at 04:59:23AM +0000, Naohiro Aota wrote:
>> On 2019/06/13 2:50, David Sterba wrote:
>>> On Fri, Jun 07, 2019 at 10:10:06PM +0900, Naohiro Aota wrote:
>>>> btrfs zoned block device support
>>>>
>>>> This series adds zoned block device support to btrfs.
>>>
>>> The overall design sounds ok.
>>>
>>> I skimmed through the patches and the biggest task I see is how to make
>>> the hmzoned adjustments and branches less visible, ie. there are too
>>> many if (hmzoned) { do something } standing out. But that's merely a
>>> matter of wrappers and maybe an abstraction here and there.
>>
>> Sure. I'll add some more abstractions in the next version.
> 
> Ok, I'll reply to the patches with specific things.
> 
>>> How can I test the zoned devices backed by files (or regular disks)? I
>>> searched for some concrete example eg. for qemu or dm-zoned, but closest
>>> match was a text description in libzbc README that it's possible to
>>> implement. All other howtos expect a real zoned device.
>>
>> You can use tcmu-runer [1] to create an emulated zoned device backed by 
>> a regular file. Here is a setup how-to:
>> http://zonedstorage.io/projects/tcmu-runner/#compilation-and-installation
> 
> That looks great, thanks. I wonder why there's no way to find that, all
> I got were dead links to linux-iscsi.org or tutorials of targetcli that
> were years old and not working.

The site went online 4 days ago :) We will advertise it whenever we can. This is
intended to document all things "zoned block device" including Btrfs support,
when we get it finished :)

> 
> Feeding the textual commands to targetcli is not exactly what I'd
> expect for scripting, but at least it seems to work.

Yes, this is not exactly obvious, but that is how most automation with linux
iscsi is done.

> 
> I tried to pass an emulated ZBC device on host to KVM guest (as a scsi
> device) but lsscsi does not recognize that it as a zonde device (just a
> QEMU harddisk). So this seems the emulation must be done inside the VM.
> 

What driver did you use for the drive ? virtio block ? I have not touch that
driver nor qemu side, so zoned block dev support is likely missing. I will add
it. That would be especially useful for testing with a real drive. In the case
of tcmu runner, the initiator can be started in the guest directly and the
target emulation done either in the guest if loopback is used, or on the host
using iscsi connection. The former is what we use all the time and so is well
tested. I have to admit that testing with iscsi is lacking... Will add that to
the todo list.

Best regards.